How to Block AI Crawlers from Scraping Your Website Using Cloudflare

Here’s a stat that stopped me mid-scroll: for the first time ever, bots now outnumber humans on the internet. Cloudflare dropped that bombshell alongside a policy shift that gives every website owner a real lever to pull — starting September 15, 2026, mixed-use AI crawlers get blocked by default on pages that serve ads.

Cloudflare office entrance — *Image: HaeB via Wikimedia Commons (CC BY-SA 4.0)*

If you run a blog, a portfolio, a SaaS landing page, or anything with content you actually wrote, this matters. A lot.

AI companies have been vacuuming up web content for years — training models, powering AI search results, feeding agentic chatbots — and most site owners never got a say in it. Cloudflare’s new tools finally let you decide who scrapes your site and for what purpose. If you’ve been building AI agents yourself, you already know how hungry these systems are for fresh data. Even better, they’re building a marketplace where you can actually get paid when your content creates value inside an AI product.

I spent the morning digging through the settings on my own Cloudflare dashboard. Here’s exactly how to lock it down.

Table of Contents

What Actually Changed (And Why It Matters)

Cloudflare broke AI crawlers into three categories:

Search bots — Crawlers that index your site so it shows up in search results. Think Googlebot, Bingbot. These you want.
AI Training bots — Crawlers collecting your content to train language models. These you probably don’t want for free. And if you’ve ever run a model locally with Ollama, you know exactly how much data goes into training even a modest LLM.
Agent bots — Crawlers fetching your content to feed into AI agents that answer user questions. These sit in a gray area — they might drive traffic, they might not.

The problem, as Cloudflare pointed out, is that the biggest players blend all three into one crawler. Googlebot, for example, crawls for Search, AI Overviews, and AI Mode — all under the same user agent. You can’t say “index me for search but don’t train on me” with a single toggle. Until now.

Starting September 15, Cloudflare flips the default: if you’re on the free plan, if you’re a new customer, or if you set up a new site, mixed-use crawlers are blocked from ad-monetized pages automatically. You can tweak this, but the default now favors publishers.

Step 1: Find Your AI Bot Controls

Log into your Cloudflare dashboard, pick your domain, and head to Security → Bots. If you’re on the free plan, you’ll see a simplified view. Pro and Business plans get more granular controls.

Here’s what you’re looking for:

Verified Bots — These are the good guys. Googlebot, Bingbot, and other known search crawlers show up here. Don’t block these unless you want to disappear from search results entirely.
Bot Fight Mode — This is the free tier’s blunt instrument. It blocks known malicious bots and challenges suspicious traffic with JavaScript. Turn it on. It won’t stop AI training crawlers specifically, but it cleans up a lot of garbage traffic.
AI Bot Controls (Pro/Business plans) — This is the new section where you can toggle Search, Agent, and Training bots separately.

I keep Bot Fight Mode on across every site I manage. It catches credential-stuffing attempts, comment spam bots, and vulnerability scanners without touching legitimate traffic. Zero configuration, just a toggle.

Step 2: Set Up WAF Rules for Specific AI Crawlers

The new AI bot categories are great for Pro users. But if you’re on the free plan — or if you want more surgical control — WAF custom rules are your best friend. Think of it like the approach I laid out in my npm dependency audit guide: you start with a broad sweep and then get specific.

Go to Security → WAF → Custom Rules and create a new rule. Here are the user agents you’ll want to watch:

GPTBot — OpenAI’s crawler
CCBot — Common Crawl (used by many AI training datasets)
ClaudeBot — Anthropic’s crawler
anthropic-ai — Another Anthropic variant
Bytespider — ByteDance/TikTok’s crawler
PerplexityBot — Perplexity AI
Omgili / Omgilibot — Used by several AI search engines
FacebookBot — Meta’s crawler (sometimes used for AI training)

For each, the rule looks like this:

Field: User Agent
Operator: contains
Value: GPTBot
Action: Block

You can stack multiple user agents in one rule using OR logic, but I prefer separate rules — it makes it easier to audit later and adjust if a crawler changes their policy.

One caveat here: blocking by user agent is a gentlemen’s agreement. Bots can spoof user agents (and some do). But most legitimate AI companies respect robots.txt and user-agent-based blocks because getting sued for scraping isn’t great for business.

Step 3: Don’t Forget robots.txt

Cloudflare’s bot controls work at the network edge. They stop requests before they hit your server. But you should also configure your robots.txt — it’s the polite sign on the door, and many AI crawlers check it first.

Add this to your robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Bytespider
Disallow: /

If you want to allow search but block training for Google specifically, Google provides Google-Extended as a separate user agent:

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

This tells Google: “Index my pages for search, but don’t use them to train Gemini or power AI Overviews.” Whether Google actually respects this is a separate conversation — but it’s the officially supported mechanism, and it costs you nothing to implement.

Step 4: Check the Attribution Dashboard

Once your rules are in place, wait 24-48 hours and check Analytics → Bot Analytics in your Cloudflare dashboard. You’ll see a breakdown of bot traffic by category, which crawlers hit your site the most, and how many requests were blocked versus allowed.

The new Attribution Business Insights dashboard (rolling out to all plans) goes further — it shows you which crawlers are fetching your most valuable pages, how often they re-fetch unchanged content, and what the potential commercial value of that traffic could be. Cloudflare says over 50% of AI crawl traffic is wasted on re-fetching pages that haven’t changed. That’s your server resources they’re burning.

The Monetization Angle: Should You Get Paid?

This is where Cloudflare’s play gets interesting. They’re building a marketplace — originally called “Pay Per Crawl,” now evolving into “Pay Per Use” — where publishers can charge AI companies when their content creates value.

Two launch partners are already on board:

Ceramic.ai pays publishers when their content appears in AI search results
You.com pays when it accesses premium publisher content

The vision is that instead of blocking every AI crawler, you selectively allow the ones that pay. Cloudflare handles the settlement in stablecoins over the x402 protocol — no payment stack of your own required.

Is this going to replace your AdSense income? Probably not. But if you’re already on Cloudflare and your content is getting scraped anyway, getting paid something is better than getting paid nothing.

What I’d Recommend for Most Sites

If you’re running a blog, portfolio, or small business site on Cloudflare’s free plan, here’s the 10-minute setup I’d do right now:

Turn on Bot Fight Mode (Security → Bots → toggle on)
Add WAF custom rules blocking GPTBot, CCBot, ClaudeBot, Bytespider, and PerplexityBot (5 rules, under a minute each)
Update your robots.txt with the disallow directives above
Add Google-Extended to your robots.txt — let Google index you but opt out of AI training
Check back in a week and review bot analytics to see what’s actually hitting your site

This won’t stop every AI crawler — nothing does. But it signals that you’re paying attention, and for legitimate AI companies that want to avoid legal headaches, that signal matters.

The internet is going through a weird transition right now. For 25 years, the deal was simple: let search engines crawl your site, and they’ll send traffic back. AI changed the math — they crawl your site and send back a summary, cutting you out entirely. We saw this tension play out when AI releases turned into political decisions practically overnight. Cloudflare’s new controls are the first serious attempt by an infrastructure company to rebalance that equation.

It won’t fix everything overnight. But it’s a lever we didn’t have last week. And when you’re a small publisher trying to keep the lights on, every lever counts.