Reddit Challenges Anthropic’s Scraping to Create Generative AI Models (Guest Blog Post)

by guest blogger Kieran McCarthy

With as much scraping as is happening for AI training and enhancement these days, it’s amazing to me that there aren’t more lawsuits happening over scraping. The legal headlines are more of a trickle than the flood I was expected.

But on June 4th, Reddit sued Anthropic in the Superior Court of California for breach of contract, unjust enrichment, trespass to chattels, tortious interference, and unfair competition.

The facts are as follows: Anthropic was trained, at least in part, on Reddit data. In fact, in the complaint, they have a quote from Anthropic’s model Claude where Claude admits that it was trained, in part, on Reddit data.

I asked ChatGPT (model o4) the same question. It replied:

Yes, ChatGPT has been trained, at least in part, on Reddit data.

OpenAI has utilized Reddit content in multiple ways throughout the development of its language models:

  1. Early Training with Reddit-Linked Content: For GPT-2, OpenAI created a dataset called WebText, which consisted of web pages linked from Reddit posts that had received at least 3 upvotes. This approach was intended to gather high-quality and diverse content from the internet.
  2. Fine-Tuning with Reddit Data: In 2019, Microsoft Research developed DialoGPT, a conversational AI model fine-tuned on 147 million Reddit conversation exchanges from 2005 to 2017. This fine-tuning aimed to enhance the model’s ability to generate human-like dialogue .
  3. Recent Licensing Agreement: In May 2024, OpenAI entered into a formal partnership with Reddit, granting OpenAI access to Reddit’s Data API. This agreement allows OpenAI to use Reddit’s real-time, structured, and unique content to train and improve its AI models, including ChatGPT .

And it’s the last part that is why this Reddit-Anthropic legal dispute has arisen. OpenAI and Google have entered into a formal licensing agreement with Reddit. Anthropic has not. OpenAI and Google are paying Reddit to access Reddit content. At least according to the complaint, Anthropic is not.

Pay to access our (user-generated) content, or we’ll sue. That’s the rub. And then the legal question is whether Reddit can restrict access to public content that is not proprietary to Reddit, but was created by its users. And according to what legal theories it can restrict that access.

One of the things that’s interesting to me here is that the suit has been filed in California Superior Court, rather than in the Northern District of California. Almost all of the major legal scraping precedents happened in the Northern District, and it is definitely unusual that this was filed in state court.

I have no idea what Anthropic’s defense to this will be, but if I were counsel for Anthropic, I would start with copyright preemption arguments. This is a content use legal dispute, at its core. And there is a legal regime dedicated to that issue, and it’s called copyright law.

I think there are very strong arguments post ML Genius and related cases that the breach of contract, unjust enrichment, and unfair competition claims should be preempted by copyright.

I think Reddit’s strongest argument here is the tortious interference claim, namely that Anthropic’s failure to follow the official protocols with scraping potentially impacts its ability to comply with its own terms of service with its users (to the extent that Anthropic is not following those protocols). That would likely not be preempted by copyright, and if proven, could lead to a successful claim.

I hate that we’re still doing trespass to chattels claims in 2025. Reddit’s allegations boil down to this paragraph: “Anthropic’s acts have diminished the server capacity and functioning that Reddit can devote to its legitimate users and thereby injured Reddit by depriving it of the ability to use its personal property.”

More than two decades ago, the Cal. Supreme Court in Hamidi said “The tort does not encompass, and should not be extended to encompass, an electronic communication that neither damages the recipient computer system nor impairs its functioning.”

This complaint alleges neither damage to the servers nor impaired functioning. Merely “diminished capacity” without any attempt to quantify whether that’s a 20% diminished capacity or a .00002% diminished capacity. De minimis diminished server capacity when you’re probably using AWS shouldn’t be a tort.

We need to go back to Hamidi on that one.

Either way, this should be another interesting and important case to follow, assuming Anthropic decides to fight rather than just pay up.

* * *

Eric’s Comments

Reddit’s centerpiece claim against Anthropic is breach of contract. So how did it form a contract with Anthropic?

You’re not going to believe this, but Reddit is trying to enforce a “browsewrap.” The relevant allegations from the complaint:

Reddit prominently displayed a link to the User Agreement on its platform. The use of Reddit’s platform is governed by the User Agreement. The User Agreement states: “By accessing or using [Reddit’s] Services, you agree to be bound by these Terms. If you do not agree to these Terms, you may not access or use our Services.”

Anthropic accepted the terms of the User Agreement every time it or its agents—including ClaudeBot, Dario Amodei, or the other authors who found Reddit data to be of the highest quality and well-suited for fine-tuning AI models—accessed or logged on to Reddit’s platform.

Dafuq? Seriously?

I decided to check out this purported placement (I didn’t see a screenshot in the complaint’s body). I couldn’t immediately find the referenced link on Reddit’s landing page. To find it (I had to do a word search), you have to scroll down and look at the bottom of the third column (under the “popular communities” widget). Here’s what I saw on June 6 in Firefox after scrolling down some:

Do you see the words “user agreement” in the bottom right? That is the foundation of the breach of contract claim. Virtually invisible link. No call to action. No action button. Nothing to ensure that parties have notice or manifest their assent. FFS.

It’s even more shocking because courts recently have been dramatically raising the bar on contract formation expectations. “Sign-in-wrap” formations that historically were just fine are failing with alarming frequency. Yet, Reddit thinks it’s going to win without even so much as a sign-in-wrap???

To put it another way: if Reddit wins the breach of contract claim based on this allegation, it will completely blow up online contract formation law as we currently know it. For more on this, see my slide deck from my May presentation on online contract formation.

Now, Reddit has more arguments it could possibly make to bind Anthropic to contract terms. It could try the Register.com v. Verio/Restatements 69 contract formation workaround. It could argue that Anthropic has publicly admitted that the TOS terms bind it. It could argue that Anthropic employees created Reddit accounts and thus learned about the restrictive terms during the account formation process. It could argue that Anthropic knew about the Robots.txt restrictions and somehow that turned the Robot.txt instructions into a contract. It could argue that Anthropic’s robots clicked on the user agreement link and assented as Anthropic’s legal agents. These are all theoretical arguments because Reddit doesn’t appear to be arguing any of this yet (some of these arguments would require an amended complaint).

What Reddit cannot do is successfully argue that its home page link to “user agreement” creates a binding contract. Browsewraps without a call-to-action are not a contract. Claiming otherwise puts Reddit–and its very capable and expensive lawyers–at risk of massive and unrelenting public and judicial derision.

Reddit’s trespass to chattels claim isn’t much better. Here’s how Reddit pleads Anthropic’s knowledge of the server delimitations: “Anthropic knowingly exceeded the permission granted by Reddit to access Reddit’s personal property, including its technological infrastructure and servers.” This is a threadbare allegation of knowledge without specifying any supporting facts. The complaint’s recitation of facts only marginally improve this. Further, as Kieran notes, the TTC harm statement is also pretty weak in light of the Hamidi standard.

__

Two other noteworthy points about this lawsuit.

A real party-in-interest to this lawsuit is OpenAI. Now that OpenAI is paying a license fee to Reddit, they need all of their rivals to bear the same costs. If Reddit can’t impose license fees on Anthropic, assume that OpenAI will look for other ways to jack up Anthropic’s cost structure to more closely mirror OpenAI’s. I explain this dynamic in my Generative AI is Doomed paper.

Also, Reddit frames itself as the champion of its users’ interests, but Reddit is walking a fine (and awkward) line. Sure, it’s nominally defending its users from rapacious scraping malefactors. But in practice, even Reddit’s values have a price tag. Reddit isn’t opposed to third parties profiting from its users’ content; it just needs its vig.

* * *

Kieran’s Supplement

Following up on Eric’s comments.

Under any reasonable current interpretation of California law, the breach of contract claim should be gone. There’s no evidence of actual or constructive knowledge of the online agreement in the complaint. There is case law in other jurisdictions that says that a sophisticated business may be held liable for breach of contract even when there is no proof of actual notice in the record (CouponCabin LLC v. Savings.com, Inc., 2017 WL 83337 (N.D. Ind. Jan. 10, 2017); Int’l Council of Shopping Ctrs., Inc. v. Info Quarter, LLC, No. 17-5526 (S.D.N.Y. May 7, 2019)), and when a business also has an online agreement that is similar to the one that is being enforced. DHI Group, Inc. v. Kent, No. 16-1670 (S.D. Tex. Oct. 26, 2017). But that reasoning has never been applied in California.

Perhaps Reddit might argue in a reply that the semi-omniscient “bots” have or should have knowledge of the terms and that knowledge should be imputed to Anthropic. But that argument is without precedent in California, either. If Reddit’s allegations pass muster for “actual or constructive knowledge,” then this would be a complete evisceration of the current standard for knowledge of actual or constructive knowledge of an online agreement for a sophisticated business in California.