Brightspot has identified a surge in site scraper traffic to websites — that is, automated processes accessing websites to copy content for use elsewhere. While not a new phenomenon, the recent explosion correlates with increased worldwide interest in artificial intelligence. Services that operate in legal “gray areas” facilitate the misappropriation of copyrighted material by taking advantage of certain technical practices designed for other purposes and providing solutions explicitly designed to defeat traditional protections, like firewalls.
With the popular emergence of AI, good content in large doses takes on new value — primarily because of something called Retrieval Augmented Generation (RAG). RAG is a way to supplement the training of a Large Language Model with more specialized content, to optimize its output... And so now we’re primed for a kind of gold rush for “free” content — and it’s happening.
Starting in early July of this year, Brightspot’s Information Security and Managed Services teams began to observe a novel traffic pattern across multiple customer sites. This traffic neither looked like the usual ebb and flow of daily website usage, nor the more dramatic spikes of breaking news or denial-of-service attacks; instead, sites were seeing concentrated bursts of otherwise-normal looking traffic every few hours.
With every website, some pages are more popular than others — maybe the homepage, or the sports page, or the jobs page — and other pages are less popular. This creates a “long tail” of pages that are accessed infrequently. Content Distribution Networks (CDNs) operate based on this principle, and are optimized for delivering the popular content quickly. When visitors are overwhelmingly looking for long-tail content (for example, news articles from 2017), the CDN doesn’t have a local copy of those articles in its collection and has to send the request on to the servers. This negatively affects what’s called a “cache hit ratio,” and is a telltale sign of site scraping. The new traffic pattern we are seeing fits this mold.
Site scrapers take advantage of a few of the less-visible parts of how popular websites, and popular search engines, work. Search engines like Google operate specialized processes, called “crawlers,” to discover web pages and make them findable. Websites want to be found by search engines, so they do two things: they put up a set of instructions for those crawlers (called robots.txt) that says things like “look here but don’t bother looking here,” and they put up a map of the whole site (called, unsurprisingly, the sitemap) so that the crawler doesn’t miss anything. Site scrapers take advantage of the sitemap, but ignore the robots.txt.
Until recently, site scrapers, even completely illegitimate ones, were only a minor concern. Monetizing stolen content on the web is very difficult, given the strict protections in place with Google and other search engines, and so aside from the occasional inconvenience, we didn’t worry a lot about it. However with the popular emergence of AI, good content in large doses takes on new value — primarily because of something called Retrieval Augmented Generation (RAG). RAG is a way to supplement the training of a Large Language Model with more specialized content, to optimize its output.
And so now we’re primed for a kind of gold rush for “free” content — and it’s happening.
Brightspot traced the offending traffic (which was at uncharacteristically high volumes, causing some stress on site infrastructure) to one of a handful of commercial scraper companies who openly advertise their services online. These companies are doing their best to take advantage of the new information gold rush, with simple instructions, clear pricing and sophisticated anti-detection tools (and also unconvincing language about “ethics”).
Unfortunately, finding is not the same as automatically blocking. While we were able to mitigate the problem-causing traffic, it was a result of careful information-gathering and analysis, rather than simply tweaking a firewall.
Making publicly-available content only selectively available is challenging, and the risk of handling that selection incorrectly is one of potentially severe reputational consequences for our customer. These scrapers are staying inside the lines of technical security (while violating those of copyright) — and the suite of tools we have at our disposal for preventing and responding to attacks is simply not yet up to the task of catching them.
At Brightspot we’re optimistic about the opportunities that AI will bring to our customers and our partners. However we know that, like every big step the internet has taken, there is a corresponding development of opportunistic abuse. We’re working with our engineers, our partners and the community to improve the tools and monitors to meet this new kind of piracy — and part of that work is raising awareness, so please feel free to share this and join the discussion.