Reddit vs Perplexity: The Opening Shots In The AI Data Arms Race?

Reddit filed a lawsuit against Perplexity AI earlier this month, accusing the AI company of unlawfully scraping its platform through backdoor methods. Which raises a pretty important question: where’s the line between open and fair data access and unauthorised scraping? At first glance, this might sound like just two tech companies at each other’s throats. But beneath the surface, it’s a fight over the value of online content and who deserves to benefit from it. In fact, the court’s ruling on the Reddit vs Perplexity case could potentially shape how sites and Web Hosting providers manage and protect content, not to mention how AI models access and use data.

KEY TAKEAWAYS

Reddit claims Perplexity knowingly bypassed technical defences to scrape data indirectly via Google search results.
Traditional crawlers index content to drive website traffic; AI crawlers gather content to generate answers, often not directing visitors to websites.
The ruling could redefine what “publicly available” means and whether the same restrictions that apply to web scrapers also apply to AI summarisation tools.
The lawsuit is making the case that access permissions and restrictions are becoming legal boundaries.
As things develop, expect more litigation, tighter data controls, and AI models that may push the limits of fair use.

What Sparked the Reddit vs Perplexity Lawsuit

Generally, the web has relied on the principle of open access, where information was shared freely, indexed by search engines, and accessible to the public. Today, that seems to be changing thanks to generative AI models that aren’t just indexing but are consuming data at a scale previously unheard of.

Reddit claims Perplexity accessed data that was never meant to be freely scraped, even if it appeared in Google search results. Perplexity, meanwhile, insists it’s being unfairly targeted for summarizing public content, much like a next-generation search engine would. Which kind of sounds like the type of excuse used when one gets caught with one’s hand in the cookie jar. But I digress.

In the court documents, Reddit alleges that Perplexity’s AI crawlers bypassed its access controls to obtain content for its search and answering service without permission, using Google search results as a backdoor to access content that wasn’t publicly available. They also identified three data-scraping services (Oxylabs, AWMProxy, and SerpApi) as co-defendants.

Strip Banner Text - Reddit claims Perplexity used Google search results for data scraping

To further back up their accusation of deliberate, large-scale scraping, Reddit says the number of its citations in Perplexity’s answers jumped nearly 40 times even after a cease-and-desist letter was sent.

This brings us to the heart of the complaint: the circumvention. Reddit maintains that, unlike licensed partners such as OpenAI, which pay for access to its content, Perplexity and the three co-defendants allegedly masked their identities and locations, among other things, to get around anti-scraping features.

In a nutshell, since the AI scrapers couldn’t access data directly on Reddit’s site, they got in through the back door and indirectly scraped content from Google Search results.

The Test Post

To prove this claim, Reddit laid a trap in the form of a deliberately hidden test post configured to be visible only to Google’s crawlers for indexing and not anywhere else.

According to the lawsuit, this supposedly “hidden” content appeared in Perplexity’s AI-generated summaries within hours, allegedly proving that Perplexity used data pulled from Google search results. Reddit argued it had found clear evidence of circumvention.

Perplexity’s Defence

Perplexity AI, however, paints a very different picture, arguing that it doesn’t scrape or store Reddit content. Instead, it says it merely summarises publicly available information while crediting the original sources, something along the lines of how search engines show snippets in results pages.

Perplexity, for its part, denied the accusation that it “scraped” the data, arguing that its platform only summarises web content while citing sources, rather than training AI models on content.

They then wrote a response on Reddit saying:

“We summarize Reddit discussions, and we cite Reddit threads in answers, just like people share links to posts here all the time.”

In other words, Perplexity sees itself as an AI-powered search summariser, not a data harvester. The company also framed Reddit’s lawsuit as an attempt to gain leverage in broader negotiations over how platforms will charge AI developers for data access.

“We do not train on Reddit data. We cite it like a search engine would cite a webpage,” the company said in its public response.

In addition, this isn’t the first time Perplexity has been accused of similar behaviour, such as plagiarising and publishing exclusive content from news sites like Forbes.

Forbes Chief Content Officer Randall Lane wrote about Aravind Srinivas, CEO and cofounder of Perplexity, in a blog dated June 11, 2024: “I’m an AI bull, and in the right hands, productivity and advances and prosperity await, but in the hands of the likes of Srinivas — who has the reputation as being great at the PhD tech stuff and less-than-great at the basic human stuff — amorality poses existential risk.”

More recently, the Content Delivery Network (CDN) service, Cloudfare, found evidence of “stealth crawling behaviour”. They claimed Perplexity was ignoring no-crawl directives after receiving complaints from its customers who had disallowed and blocked the public (declared) PerplexityBot crawler, yet content was still being accessed.

Without getting too technical, Cloudfare observed that they were instead using undeclared crawlers using multiple IP addresses to hide their identities to get around the blocks.

Make of that what you will.

The New Breed: AI Crawlers vs. Traditional Search Bots

To give you a better understanding of what’s going on in this case, it helps to know how AI crawlers collect data that differs from how classic search engine bots index and crawl websites.

The main difference between traditional search crawlers like Googlebot and the new generation of AI crawlers (GPTBot, PerplexityBot) is what they’re made for and the methods they use.

Traditional bots are designed for indexing so that they can understand a page’s content and send visitors to your website via Search Engine Results Pages (SERPs). They operate transparently, identifying themselves and generally respecting robots.txt files, which specify which of your pages can and cannot be indexed.

On the other hand, AI crawlers are made to gather as much information as possible, pulling massive amounts of text, code, and structured data to train Large Language Models (LLMs). They then use Natural Language Processing to generate answers to questions directly in the chat when you ask them a question, meaning you don’t necessarily have to visit the source website.

A Wikipedia report shows a steady decline in its human page views due to generative AI, which provides answers directly rather than sending visitors to websites.

Traditional bots make many small visits and look for incremental updates to ensure the indexed content is fresh and relevant. They have a “lighter” footprint and are designed to not impinge on the website’s server resources.

AI crawlers have a heavier footprint, gathering much more data per request to train LLMs and power AI search answers. They make fewer, but much larger, requests (when the bot asks the website server for a page or file). They often download entire pages and related elements in bulk because their goal is to understand the content’s context and substance, which consumes more of a website’s bandwidth and server resources.

As you can imagine, this super-aggressive crawling by AI bots puts major strain on a site, causing slow load times or even complete crashes.

Lastly, while search bots obey robots.txt files and crawl directives, many AI bots can hide their identities by changing IP addresses via proxies and deliberately ignore, or circumvent them to get at the data needed for training LLMs.

This technically unethical behaviour (depending on who you ask) is where the problem comes in.

The Legal and Technical Questions at Play

This lawsuit is heading into a legal grey area where the regulations are only beginning to be defined. It intersects copyright law, anti-circumvention rules, and emerging AI data-use standards. Some of the burning questions on data privacy and security that now need to be addressed are:

Can accessing data via Google’s search-indexed pages still count as “unauthorised”?
Does summarising publicly available content, with citations, actually differ legally from gathering it to use in an AI model answer?
How much control over their content do website owners truly have once their pages are indexed by search engines, given what is happening with AI crawling?

From a technical standpoint, the case revolves around access controls and the boundaries defining what is fair game for data retrieval. This also has the knock-on effect of potentially, directly jeopardizing content creators’ monetisation opportunities by using their intellectual property without their consent or compensation.

Not to mention, it could set a dangerous precedent for data-gathering ethics. In this author’s opinion, this case isn’t just about content scraping and who uses it and to what end; it’s a warning.

Strip Banner Text - AI crawlers pull site data to train LLMs, often ignoring access rules

The Implications for Web Hosting and Content Creators

Based on all of this, if you run a popular blog or a successful public forum, user-generated content (UGC) can now be a valuable asset. For example, Reddit’s vast amount of UGC can be seen as a goldmine for LLMs, which rely on human-generated content to understand and summarize information on the web.

This means high-quality web pages, articles, and comments are not just for getting visibility in online searches; they are an increasingly valuable form of training data that (honest) AI companies are willing to pay for or; as alleged, steal.

That distinction between summarising versus training on data is now the centre of the debate and could potentially dictate how LLMs operate in the (very) near future. For online businesses, this case has direct implications.

If Reddit’s complaint wins, it could pave the way for tighter enforcement of access rights, meaning stricter limits on what third parties can use from your site without explicit consent. That might also increase demand for structured data licensing.

If Perplexity wins, however, the decision could encourage a more open interpretation of web content as “fair-use” for AI models and search tools.

From a practical standpoint, website owners and SMEs should start looking at:

If your robots.txt files and access permissions are configured to exactly how you want data to be used.
How search engine snippets and site feeds expose your content to certain AI tools.
The terms of service related to user-generated content and its use by third parties.

Data Protection with Domains.co.za

As you can see, relying solely on a basic robots.txt file is no longer enough to protect your site’s content. The first line of defence against AI crawler scraping is your Web Hosting. At Domains.co.za, we have you covered.

Our Web and WordPress Hosting plans include advanced server-level and Web Application Firewalls (WAFs) that can identify and block harmful bot traffic and DDoS attacks.

Analysis toolsmonitor traffic patterns and bandwidth consumption for suspicious activity, flagging and blocking it. Remember, legitimate search engine crawlers have predictable behaviour; AI bots make rapid, bulk requests,

Our servers automatically limit the number of requests an IP address can make per second. If you want even more control, you can also add .htaccess rules to block or allow specific ones. This helps prevent bulk data extraction while maintaining a smooth user experience.

With Domains.co.za, you get the dedicated resources your site needs for maximum performance and stability, with a setup tailored to your specifications and expert management, so you can focus on growing your business.

Having a VPS (Virtual Private Server) means your site gets its own virtual environment. Our VPS Hosting plans, available for Windows and Linux, let you configure custom access rules and install specialised software to protect your site from unwanted crawling. You can also fine-tune your settings to optimise performance for your specific needs, such as increasing bandwidth limits or adjusting processing power.

Following that, our Managed cPanel Web Hosting also helps mitigate the impact of aggressive AI crawlers by providing you with your own virtual environment, similar to our VPS Hosting, with the added benefit of being managed by our expert support team.

What Happens Next

Reddit is seeking damages and an injunction to block Perplexity as well as Oxylabs, SerpApi, and AWMProxy from further scraping and permanently stopping them from using or selling any previously scraped Reddit data.

The stakes are high as they encompass defining what constitutes unfair competition, copyright infringement, and benefiting from other people’s work when it comes to AI. A ruling in Reddit’s favour would cement content owners’ ability to dictate the terms under which their data is accessed and used; and even force other companies to the negotiating table.

Conversely, a loss could essentially open the floodgates to even more sophisticated, potentially underhanded, scraping of publicly accessible content, regardless of restrictions, and let’s not forget ethics. Although based on what we’ve seen so far from some companies, ethics doesn’t even play a walk-on role in this drama.

The lawsuit is still in its early stages, but its outcome could shape how platforms and AI companies negotiate data access from here on out.

Courts will need to decide whether accessing Reddit via Google’s search results counts as circumvention and whether summarisation constitutes derivative use, while defining the boundaries between “fair use” and “theft” in the context of AI.

Regardless of the verdict, this case adds fuel to what’s most likely to become a legal dumpster fire: how to balance the open fair use of web content with compensation and credit to its creators.

Strip Banner Text - Keep your content and site safe with Domains.co.za [Read How]

How to Choose the Perfect Domain Name

VIDEO: How to Choose & Register the PERFECT Domain Name

FAQS

What is the Reddit vs. Perplexity AI lawsuit about?

Reddit is suing Perplexity AI, claiming the company illegally accessed and reused Reddit content without permission. The lawsuit alleges Perplexity bypassed Reddit’s access controls by scraping data through Google search results.

What does Perplexity do?

Perplexity is an AI-powered search and summarisation tool that provides direct answers to user queries.

What’s the difference between scraping and summarising?

Scraping involves automatically collecting data from a website, often in bulk. Summarising means processing or paraphrasing visible content while citing its source.

How can I protect my content from AI crawlers?

Review your robots.txt file and access control settings, set clear rules about data usage, and use web hosting with features that block unwanted crawlers.

How does AI training differ from summarising?

AI summarisation generates answers to questions on the fly using existing data. AI training, however, involves gathering massive datasets from the web to teach LLMs to generate text.

Other Blogs of Interest

Rhett Freeman

Rhett isn’t just a writer at Domains.co.za – he’s our resident WordPress content guru. With over 8 years of experience as a content writer, with a background in copywriting, journalism, research, and SEO, and a passion for websites.

Rhett authors informative blogs and articles that simplify the complexities of WordPress, website builders, domains, and cPanel hosting. Rhett’s clear explanations and practical tips provide valuable resources for anyone wanting to own and build a website. Just don’t ask him about coding before he’s had coffee.

Reddit vs Perplexity: The Opening Shots In The AI Data Arms Race?

KEY TAKEAWAYS

TABLE OF CONTENTS

What Sparked the Reddit vs Perplexity Lawsuit

The Test Post

Perplexity’s Defence

The New Breed: AI Crawlers vs. Traditional Search Bots

The Legal and Technical Questions at Play

The Implications for Web Hosting and Content Creators

Data Protection with Domains.co.za

What Happens Next

How to Choose the Perfect Domain Name

FAQS

What is the Reddit vs. Perplexity AI lawsuit about?

What does Perplexity do?

What’s the difference between scraping and summarising?

How can I protect my content from AI crawlers?

How does AI training differ from summarising?

Other Blogs of Interest

What Our Customers say...

Hosting

Domains

Security

Resellers

Legal

Our Company

Help