Cloudflare Blocks AI Bots: A “Too Little, Too Late” Effort to Curb Scraping

Cloudflare will now block AI crawlers from accessing website content without permission, a move impacting AI model training. This policy, effective immediately for new domains, allows website owners to choose whether to allow AI scraping. Furthermore, Cloudflare will introduce a “pay per crawl” model, allowing publishers to charge AI crawlers for content access. As a major content delivery network, Cloudflare’s influence on roughly 16% of global internet traffic means this policy could significantly alter how AI companies obtain data.

Read the original article here

Web giant Cloudflare to block AI bots from scraping content by default is a big move, and honestly, a little late to the party. But hey, better late than never, right? It’s a bit like trying to close the barn door after the horses have bolted. Still, I’m glad they’re doing something. They’re trying to stop AI bots from grabbing content from websites without permission.

The motivation, at least in my estimation, isn’t solely to protect creators. Creators have been optimizing content for search engines for years, churning out SEO-optimized content to get to the top of search results. The reality is, search results are often curated, designed to push you toward a purchase or manipulate you in some way. It’s all geared towards the SEO and PPC industries, which is trying to protect the existing system from being disrupted by a more efficient tool. So, Cloudflare’s decision, I think, is primarily about preserving the existing search ecosystem.

Here’s the deal. Right now, it seems like 99% of the new content on the internet is generated by AI algorithms. The challenge for Cloudflare is, how will they stay relevant and have their content show up in search results anymore? If your medical research or other vital content isn’t showing up in searches, then they will be in trouble.

Cloudflare is doing this by blocking specific AI bots. They’re classifying bots based on their user-agent signatures and behavior, separating them into AI and search engine categories. The “Block AI Bots” rule specifically targets bots from OpenAI (like ChatGPT and GPTBot), Amazon, Apple, ByteDance (TikTok), Anthropic (ClaudeBot), DuckDuckGo, Google, Meta, Huawei, and Common Crawl. Search indexers such as Googlebot will still be allowed.

However, the list of blocked bots isn’t perfect, and there are some notable omissions. Some of the Google bots, for example, are categorized incorrectly, allowing a major AI crawler to remain unblocked. They’re also leaving out Perplexitybot. It’s tough to say why, but perhaps they are strategically targeting certain players in the AI space.

Why is all of this necessary? Well, for site owners, these bots can be a real pain. They can cause slow servers because they’re constantly scraping sites. Plus, they can be expensive for small companies, especially if they’re using services like Amazon S3 to host their media. The bots are also tricky because many of them are disguised as regular users, using common user agent information like an Apple device or Chrome.

It’s a fast-moving world, and information needs to be constantly updated. But Cloudflare’s moves are important because if information isn’t up to date, then the information may be meaningless.

And here’s where it gets even more interesting. Because the bots are pulling information for the newest updates, they are also pulling in information for AI responses. So now, the bots need to get “internet content” via these crawlers.

Regarding concerns about the data used to train AI models, like information from government sources, that data isn’t necessarily in an easily digestible format. However, raw data can be valuable.

A clever workaround has been employed by some website owners: adding a human check for browsers that land directly on the search page. This is how to solve some of the problems that the bots cause.

The bigger picture here is how AI has changed things. Google search results are full of AI-generated blog posts now. It’s easier and better for people to either do a “ Reddit” search or use something like Le Chat or ChatGPT.