How Can AI Models Legally Obtain Training Data?–Doe 1 v. GitHub (Guest Blog Post)

by guest blogger Kieran McCarthy

Doe 1 v. GitHub, Inc. is one of the first major class-action lawsuits to dive into questions of online collection of “public data” and generative AI training data sets. Given the importance of generative AI and the implications for other generative AI projects, and the number of legal issues involved, many legal observers were keen to follow the first few breadcrumbs in this case.

On May 11th, the court ruled on the Defendants’ Motion to Dismiss, granting in part and denying in part. Specifically, the court granted Defendants’ Motion to Dismiss on the CCPA, tortious interference, false designation of origin, fraud, breach of terms of service, unfair competition, negligence, and civil conspiracy claims (only the last of these was dismissed with prejudice); the court denied Defendants’ Motion to Dismiss on the DMCA Section 1202(b)(1), Section 1202(b)(3), and breach of license agreement claims. The court also held that the coders did not have standing to seek damages, but they did have standing to pursue injunctive relief. The court also held that plaintiffs were permitted to proceed pseudonymously.

What does all that mean for companies looking to develop generative AI, and the online sources of their training data that might be looking to stop them?


We can infer from this opinion that treatment of Copyright Management Information (“CMI”) will be tricky for generative AI developers. Also, ignoring copyright licenses is at least arguably copyright infringement, and your fair use claim probably won’t get you out of the lawsuit at the motion to dismiss stage.

But most of the other conclusions of this opinion involve arcane issues of civil procedure, which are unlikely to be reproduced with other fact patterns.

This lawsuit features a group of coders who filed a class-action lawsuit pseudonymously against GitHub, OpenAI, Microsoft, and various affiliates for allegedly “ignor[ing], violat[ing], and remov[ing] the [open-source software licenses] offered by thousands—possibly millions—of software developers, thereby accomplishing software piracy on an unprecedented scale.” Complaint at 2.

OpenAI, creator of ChatGPT, GPT-3 and GPT-4, Codex and Copilot AI systems, is the consensus leader in the race to create AI that may take all of our jobs and destroy the human race be the most disruptive technology since the invention of the printing press. OpenAI is also an affiliate of Microsoft, which is also the owner of GitHub, the popular online code repository. According to the complaint, these separate entities are just one big data-sharing family, leveraging their combined resources in non-standard ways such as Microsoft sharing hardware and cloud infrastructure resources in exchange for an ownership interest in OpenAI. According to the complaint, Microsoft’s ongoing relationship with OpenAI has led some to describe Microsoft as “the unofficial owner of OpenAI.” Complaint at 31.

The crux of the complaint is that OpenAI took code that was stored on GitHub and used it as training data to build out AI systems called Codex and Copilot. And while most of the code that was stored in the systems was likely marked “public” and was also open source, the code was subject to certain licensing and attribution requirements that were allegedly ignored when they were used in the training data to create OpenAI’s Codex and Copilot systems.

For machine learning and artificial intelligence systems to do what they do, they need training data. Training data is the initial data set that allows a machine learning system to learn to do whatever someone is trying to teach it to do. If your system is a large language model like ChatGPT, you’re using training data to teach your system to communicate effectively and intelligently and not hallucinate court cases that do not exist. If it’s an AI system that can manipulate images through text, it’s the set of pictures you use to teach that system to do what it does. If it’s a system that helps people code, like Codex and Copilot, it’s the billions of lines of code you use to teach the system how coding typically works.

Simply put: Training data is the data-fuel that makes AI systems go. And the best way to get it is to collect massive amounts of data from some “public” source. But as we have seen in other contexts, the consequences of taking “public data” might be more complicated than they initially seem.

Plaintiffs brought twelve separate claims against Defendants. And the case had lots of procedural wrinkles. In this post, I’m only going to focus on the findings that I think may be replicated in other cases going forward.

Failure to State a Claim for Breach of GitHub’s Terms of Service

The parties agree that all code uploaded on GitHub is subject to its Terms of Service.

Users retain ownership of content they upload to GitHub, but grant GitHub:

the “right to store, archive, parse, and display [the content], and make incidental copies, as necessary to provide the Service, including improving the Service over time.” No. 22-cv-7074-JST, ECF No. 1-2 at 27. This “includes the right to do things like copy [the code] to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; [and] share it with other users.” Id. at 27-28. Further, the Terms of Service provide that users who set their repositories to be viewed publicly “grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform [the content] through the GitHub Service and to reproduce [the content] solely on GitHub as permitted through GitHub’s functionality.” Id. at 28.

Doe 1 v. GitHub, Inc. 2023 WL 3449131 at *1 (N.D. Cal. May 11, 2023).

On its face, taking users’ code, giving it to an affiliate, and letting the affiliate use it to create a training data set to compete with the creators of the code would seem like a violation of those terms.

But not so, says the court.

Plaintiffs have not met their burden to allege facts demonstrating an injury-in-fact sufficient to confer standing for their privacy-based claims. Plaintiffs’ claims for breach of the GitHub Privacy Policy and Terms of Service, violation of the CCPA, and negligence are dismissed with leave to amend.

Id. at 5.

At first, I was shocked by this conclusion since it seems obviously wrong. But I think this might have more to do with the way the lawyers pleaded this issue rather than the quality of the potential breach of contract claim here.

Plaintiffs’ lawyers focused their breach of terms of service arguments in the complaint entirely on misuse of personal data. According to paragraph 216 of the complaint:

GitHub has substantially and materially breached GitHub’s Policies in the following ways:

a. Sharing Plaintiffs’ and the Class’s personal data with unauthorized third parties in violation of the GitHub Privacy Statement;
b. Selling and distributing Plaintiffs’ and the Class’s personal data in contravention of the GitHub Policies;
c. Use of Plaintiffs’ and the Class’s personal data after the GitHub Privacy Statement explicitly claims it will be deleted;
d. Use and distribution of Plaintiffs’ and the Class’s personal data outside the limitations set forth in the GitHub Privacy Statement.

At first blush, for sub-paragraphs a-c, I think if the plaintiffs had replaced “personal data” with “code,” they might have prevailed on this claim. It’s not that GitHub gave their personal data to an affiliate; they gave their code to an affiliate.

The court said that the plaintiffs failed to identify any instance of personal data that was shared in violation of the terms, and so dismissed the claims.

Plaintiffs just amended their complaint to clarify:

GitHub’s Privacy Statement defines “personal data” to include “any . . . documents, or other files”, a definition that necessarily comprises source code, and hence the Licensed Materials. (As of May 2023, GitHub has updated this provision on its website to explicitly read “any code, text, … documents, or other files”). Elsewhere, the Privacy Statement provides “We do not sell your personal information,” “No selling of personal data,” “We do not sell your personal data for monetary or other consideration.” (Emphasis in original). By making the Licensed Materials available through Copilot in violation of the Suggested Licenses, and charging subscription fees, GitHub has been selling Licensed Materials. By selling the Licensed Materials, GitHub has breached these provisions in GitHub’s Policies against selling user data.

Amended Complaint at 56, paragraphs 234 and 235.

Based on my read of these facts, that should be enough to survive another motion to dismiss. We’ll see if the court agrees.

Plaintiffs Lack Standing to Bring Damages Claim But Have Standing to Seek Injunctive Relief

One thing that makes these generative AI cases so difficult from plaintiff’s lawyers’ perspective, is that even though there is often obvious copying and use of copyrighted materials without permission, because of the nature of generative AI, the final product does not often include clear indications of the source of data. This will make it hard to prove damages. And this case was no exception.

The court found:

[W]hile Plaintiffs identify several instances in which Copilot’s output matched licensed code written by a Github user, Compl. ¶¶ 56, 71, 74, 87-89, none of these instances involve licensed code published to GitHub by Plaintiffs. Because Plaintiffs do not allege that they themselves have suffered the injury they describe, they do not have standing to seek retrospective relief for that injury.

Id. at *5.

Not all was lost, however.  Plaintiffs argued that with the popularity of Copilot, it is a near certainty that their code will be used with copyright notices removed or in violation of their open-source licenses. The court said that “While Plaintiffs have failed to establish an injury-in-fact sufficient to confer standing for their claims for damages based on injury to property rights, they have standing to pursue injunctive relief on such claims.” Id. at 7.

Plaintiffs have since amended their complaint to identify many alleged instances where Plaintiffs’ code was used verbatim or nearly so without proper attribution. Again, my read on this record is that should be enough to put damages back in play here and survive another motion to dismiss.

DMCA and the CMI Hornets’ Nest

The part of this case that seems the most likely to stick is the Plaintiffs’ DMCA claims.

According to the court:

“Copyright law restricts the removal or alteration of copyright management information (“CMI”) – information such as the title, the author, the copyright owner, the terms and conditions for use of the work, and other identifying information set forth in a copyright notice or conveyed in connection with the work.” Stevens v. Corelogic, Inc., 899 F.3d 666, 671 (9th Cir. 2018). Section 1202(b) of the DMCA provides that one cannot, without authority, (1) “intentionally remove or alter any” CMI, (2) “distribute … [CMI] knowing that the [CMI] has been removed or altered,” or (3) “distribute … copies of works … knowing that [CMI] has been removed or altered” while “knowing, or … having reasonable grounds to know, that it will induce, enable, facilitate, or conceal” infringement. 17 U.S.C. § 1202(b).

Doe 1 at *11.

According to the record, OpenAI’s Copilot does not reproduce CMI as it had been altered not to produce CMI. Defendants argued that passive non-inclusion of CMI is different from removal of CMI, but the court was not persuaded. Plaintiffs stated a claim for violation of Section 1202(b)(1) and 1202(b)(3) of the DMCA.

While what generative AI developers do with copyrighted materials is very different from “ripping CDs” and the types of technology that were the original purpose behind the DMCA, the DMCA might end up as the single biggest legal obstacle to the development of generative AI.

Violating Licensing Terms Might Constitute Breach

Plaintiffs’ logic on the breach of license claims is similar to the DMCA claims. Plaintiffs allege:

Plaintiffs advance claims for breach of the eleven suggested licenses GitHub presents to users that require (1) attribution to the owner, (2) inclusion of a copyright notice, and (3) inclusion of the license terms. Compl. ¶ 34 n.4. Plaintiffs attach each of these licenses to the complaint. Plaintiffs allege that use of licensed code “is allowed only pursuant to the terms of the applicable Suggested License,” and that each such license requires that any derivative work or copy include attribution, a copyright notice, and the license terms.

Id. at *13.

Plaintiffs alleged that Defendants reproduced code as output without attribution, copyright notice, or license terms. That was good enough for the court to allow the breach of license claim to proceed.

I’ve only touched on a fraction of the issues in this case, and this blog is already well over 2,000 words. And this case is just beginning. In terms of the implications for other businesses doing similar things, the real tension point would seem to be whether Plaintiffs are able to get an injunction to stop use of output based on Plaintiffs’ training data. Since it would likely be impossible to strip out the impact of some training data from other training data, depending on the scope of an injunction, there is a possibility that an injunction could shut down the entire enterprise of these AI systems. I imagine this case would settle before we get to that point.

I’m not sure there’s enough in this early-stage opinion to allow us to predict with any confidence how these issues will shake out. But here’s one opinion I feel confident about: regardless of whether you think generative AI will put millions of lawyers out of work in the long term, in the short and intermediate term, it’s going to keep lots of lawyers busy.