More and more websites report to suffer from AI attacks. To be precise, LLMs are the problem, again. Uberspace, independent free of charge self-hosting is running into trouble with their data and network usage. ReadtheDocs has Terabyte of unwanted data, and Clouflare is offering help. And the source of it all? AI scrapers.
Half of the traffic from AI crawlers
There’s days when up to 30 to 50% of the requests that reach Uberspace come from bots intending to gather data for training of large language models, most of them from Facebook and Anthropic.
Needless to say that Uberspace already has robots.txt that deny permission for crawlers. Sadly, corporate AI will not respect that. Why would they? Technology companies that don’t mind killing the planet, jobs, arts, societies can not be expected to respect small web server’s configurations.
73 Terabytes of abusive content
Nevertheless the Uberspace story is interesting, albeit it’s only available in German. And they point out that they’re not alone: Only four weeks ago, Read The Docs reported similar problems, they claim to have suffered from “73 TeraByte from one crawler” … in only one month (May 2024)! “This cost us over $5,000 in bandwidth charges, and we had to block the crawler.” Uberspace did take similar actions.
Customers dont want AI
And if you think the problem is exotic: It is not. It’s so common that Cloudflare has already started an AI blocker “product” (it’s rather a setting than a feature) in late 2023.
Since more and more AI bots are not behaving well, not respecting robots.txt, Cloudflare extended that offer: “To help, we’ve added a brand new one-click to block all AI bots.” because “We hear clearly that customers don’t want AI bots visiting their websites, and especially those that do so dishonestly.”
Seems like AI’s image is getting worse. It’s about time to ditch the rich man’s toy. Remember: LLM is not AI, and AI is not LLM. All of it is only IT. Modern IT. Some of it is ethical, LLMs are not.