Web Scraping for Me, But Not for Thee (Guest Blog Post)
by guest blogger Kieran McCarthy
There are few, if any, legal domains where hypocrisy is as baked into the ecosystem as it is with web scraping.
Some of the biggest companies on earth—including Meta and Microsoft—take aggressive, litigious approaches to prohibiting web scraping on their own properties, while taking liberal approaches to scraping data on other companies’ properties.
When we talk about web scraping, what we’re really talking about is data access. All the world’s knowledge is available for the taking on the Internet, and web scraping is how companies acquire it at scale. But the question of who can access and use that data, and for what purposes, is a tricky legal question, which gets trickier the deeper you dig.
Some forms of data are protected by copyright, trademark, or another cognizable forms of intellectual property. But most of the data on the Internet isn’t easily protectible as intellectual property by those who might have an incentive to protect it.
For example, the most aggressive companies in pursuing web-scraping litigation are the social media companies. LinkedIn and Facebook, most notably, have done as much as anyone to shape the law of web scraping. But the content that they’re trying to protect isn’t theirs—it belongs to their users. It’s user-generated content. And while their terms of use provide the social media companies a license to use that user-generated content, it is their users who typically have a copyright interest in their content. The social media companies have no cognizable property right to assert in this content/data.
But make no mistake, these companies view this data, generated by their users on their platforms, as their property. This is true even though the law does not recognize that they have a property interest in it, and even though they expressly disclaim any property rights in that data in their terms of use.
Since the law does not give them a cognizable property interest in this data, they must resort to other legal theories to prevent others from taking it and using it.
In the early days of the Internet, the primary legal theory that companies used to stop scrapers was something called trespass to chattels. This is why Eric—who has been covering this issue for a good while now—tags all scraping posts as “Trespass to Chattels.”
The idea behind this legal theory is that web scraping—often high-volume, unwanted data requests—are a form of trespass on private tangible property—computer servers. But the thing about trespass to chattels is that it requires both a trespass to private tangible property and an element of damages. In the early days of the Internet, when Internet connections sounded like this, it didn’t take a lot of extra traffic to damage someone’s server or the ability to provide a functioning website. Many web scrapers were clumsy and didn’t realize the impact of their additional requests on servers. In the late 1990s and early 2000s, web scraping often did burden or shut down websites.
But as technology improved, this legal theory stopped making as much sense. Server capacity improved by many orders of magnitude, and most scrapers became savvy enough to limit their requests in a way that became imperceptible or at least inconsequential to the host servers. Now, one of elements of the trespass to chattels legal claim—damage to the servers or other tangible property of the host, very rarely happens.
Next, from the early 2000s until 2017, the primary legal theory that was used to deter web scraping was the Computer Fraud and Abuse Act or the CFAA. The CFAA prohibits accessing a “protected computer” without authorization. In the context of web scraping, the question is whether, once a web scraper gets its authorization revoked (usually via cease-and-desist letter, but often in the form of various anti-bot protections), any further scraping and use of a website’s data is “without authorization” within the meaning of the CFAA.
From 2001 to 2017, the simplistic answer was yes, any form of revocation of authorization was typically sufficient to trigger CFAA liability, if the scraper continued to access the site without permission. And then, in 2017, the famous hiQ Labs, Inc. v. LinkedIn Corp. case came out, which affirmed a plaintiff web scraper’s right to access public LinkedIn data under the CFAA. The Ninth Circuit affirmed, holding:
We agree with the district court that giving companies like LinkedIn free rein to decide, on any basis, who can collect and use data—data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use—risks the possible creation of information monopolies that would disserve the public interest.
Many interpreted this as allowing an affirmative right to scrape public data, even if that was not the correct reading of the law and the reality was always more nuanced.
In the end, it was a pyrrhic victory. hiQ Labs lost that case, and at summary judgment the district court held that “LinkedIn’s User Agreement unambiguously prohibits scraping and the unauthorized use of scraped data.” LinkedIn obtained a permanent injunction and damages against hiQ Labs on that basis.
Now, the primary vehicle to stop web scraping is with breach of contract claims.
For example, in just the last few weeks, Twitter/X Corp. has filed multiple lawsuits against web scrapers, including against Bright Data, which is perhaps the biggest web-scraping company in the world.
Ten years ago, in web-scraping cases, you’d typically see plaintiffs in scraping cases file 10-15 legal claims, with law firms exploring any legal theory that might stick. Now, in its case against Bright Data, Twitter’s lawyers filed three claims: breach of contract, tortious interference with a contract, and unjust enrichment. Lawyers are increasingly confident that courts will enforce the breach of contract claim against scrapers and obtain the relief thy want. They don’t need or seek alternative legal theories.
And it is this legal reality—web scraping legal enforcement through breach of contract—that allows companies to assert property rights regarding how people use and access data—through the domain of contract law.
Mark Lemley observed this happening nearly 20 years ago, in his prescient, seminal article, “Terms of Use.”
The problem is that the shift from property law to contract law takes the job of defining the Web site owner’s rights out of the hands of the law and into the hands of the site owner. Property law may or may not prohibit a particular “intrusion” on a Web site, but it is the law that determines the answer to that question. The reason my “no-trespassing” sign is effective in the real world is not because there is any sort of agreement to abide by it, but because the law already protects my land against intrusion by another. If the sign read “no walking on the road outside my property,” no one would think of it as an enforceable agreement. If we make the conceptual leap to assuming that refusing to act in the way the site owner wants is also a breach of contract, it becomes the site owner rather than the law that determines what actions are forbidden. The law then enforces that private decision. [citations omitted]
Mark Lemley, 2006 Minnesota Law Review, Terms of Use at 471.
With the breach-of-contract-as-property legal regime, host websites are free to define their rights in online data however they want, in the form of online terms of use agreements.
Rather than creating a new intellectual property regime with general rules for data use—or even simpler—deciding cases using existing intellectual property rules, courts have allowed host websites to create their own intellectual property rights in website data, through the mere act of declaring such data to be property through an online contract. Companies have almost complete liberty to declare data that is not entitled to intellectual property protection to be “proprietary,” and courts allow them to enforce this ad hoc intellectual property regime through breach of contract claims (as long as they aren’t so foolish as to do it in a way that is co-terminus with copyright protections).
—
And this is where the hypocrisy comes in: the breach-of-contract-as-property legal regime has no legal requirement for intellectual honesty or consistency. It has no requirement to respect others’ IP akin to trademarks or patents in the same way that you do your own. Companies are free to press their advantage on what is deemed “proprietary” on their sites while simultaneously asserting what is free for the taking on others. It is easy to criticize this, but this is what smart lawyers and legal teams do.
—
Let’s look at what Microsoft is doing right now, as an example.
In the last couple of weeks, Microsoft updated its general terms of use to prohibit scraping, harvesting, or similar extraction methods of its AI services.
Also in the couple of weeks, Microsoft affiliate OpenAI released a product called GPTbot, which is designed to scrape the entire internet.
And while they don’t admit this publicly, OpenAI has almost certainly already scraped the entire non-authwalled-Internet and used it is training data for GPT-3, ChatGPT, and GPT-4.
Nonetheless, without any obvious hints of irony, OpenAI’s own terms of use prohibits scraping.
Last year, Microsoft subsidiary LinkedIn loudly and proudly declared victory in the most high-profile web-scraping litigation in US history, imposing a permanent injunction on a former business rival to prevent it from scraping and accessing its private and public data forever. VP of Legal Sarah Wright declared, “The Court’s ruling helps us better protect everyone in our professional community from unauthorized use of profile data, and it establishes important precedent to stop this kind of abuse in the future.”
—
I’m picking on Microsoft, as it is the most flagrant offender here. But I could pick on hundreds of others who are also hypocritical on this issue. Notably, Meta is also famously suing a company right now for scraping and selling its public content, even though Meta once paid the same scraper to scrape public data for them.
As I said at the start of this post, hypocrisy is endemic to this legal regime.
—
I, for one, don’t blame Microsoft or Meta or any of the other companies that take hypocritical stances on scraping. That’s what smart legal teams do when courts allow them to do it.
I blame the courts.
I blame the court in Register.com v. Verio, Inc. that paved the way for contracts of adhesion in the absence of assent. I blame the Northern District of Texas for enabling Southwest Airlines to sue anyone that publishes public information about their flights. I blame the court in the hiQ Labs case that made no attempt to explain the disconnect or inconsistency on why hiQ Labs was entitled to a preliminary injunction on its CFAA claim, but LinkedIn was entitled to a permanent injunction on its breach of contract claim on the exact same facts a few years later.
Courts need to realize that if you allow private companies to invent intellectual property rights through online contracts of adhesion, courts will be at the mercy of private decision-makers on questions that should be questions of public interest.
But given the fact that contracts, even online contracts, are a state-law issue, it’s hard to imagine a simple resolution to this problem. One possible solution might be a more all-encompassing interpretation of the copyright preemption doctrine, but the current law of copyright preemption is a muddled mess of a circuit split and the Supreme Court just declined an opportunity to resolve it.
—
But regardless of what you and I think about this legal regime, that is the current state of the law.
The next testing ground for it will be with these generative AI cases.
I’ve long said we have not yet reached a stable equilibrium on these issues, because this kind of inconsistency in the law cannot be sustained. That means we are likely to see plenty of fireworks on these issues in the next few years.
Pingback: Friday assorted links - Marginal REVOLUTION()
Pingback: Web Scraping Hypocrisy | mtanenbaum()
Pingback: AI #27: Portents of Gemini | Don't Worry About the Vase()
Pingback: We Love the Internet 2023/35: The End of the googleverse edition | Curiously Persistent()