AI Crawler Management
robots.txt for AI bots, llms.txt and llms-full.txt status and debate, Cloudflare AI bot management, and the visibility-vs-IP-protection trade-off.
The fastest way to lose AI visibility in 2026 is to copy a “block all AI bots” robots.txt template from a privacy-first blog post and ship it without thinking. The fastest way to gain AI visibility is to make a deliberate, per-bot policy distinguishing training crawlers from search/retrieval crawlers and let the right ones in. This module covers that policy, plus the live debate around llms.txt.
TL;DR
- Training crawlers and search crawlers are different bots.
GPTBot(training) is separate fromOAI-SearchBot(live search retrieval).ClaudeBotandAnthropic-AI(training) are separate from Claude’s web tools (Brave-mediated). Block training, allow search if you want citations. llms.txtis a proposed standard, not a deployed one. As of May 2026, no major AI engine officially honorsllms.txtfor retrieval prioritization. Implement it as a low-cost hedge but don’t rely on it as a control surface.- Cloudflare’s AI bot management gives the most granular control. The
verified botdirectory plus per-bot allow/block rules, plus AI Audit and pay-per-crawl, are the production-grade lever. Start there if you need real enforcement.
The mental model
AI crawler management is like managing visitors to a museum gift shop. The gift shop wants tourists who’ll buy postcards (search retrieval bots that send users back to your site). It doesn’t necessarily want photographers who’ll publish your inventory in their own catalog with no attribution and no kickback (training crawlers that bake your content into a model with zero traffic return).
But you can’t simply ban all cameras. Some “photographers” turn out to be journalists who’ll write a feature about the gift shop and bring you customers (search crawlers and on-demand fetchers from the same vendor). The right policy is per-camera: which lens, on whose behalf, for what use. That’s exactly the per-user-agent policy the rest of this module describes.
The llms.txt debate fits inside this metaphor: it’s a sign on the door saying “preferred photographers, please use this gallery first.” Whether photographers honor the sign is up to them; today most don’t.
Deep dive: the 2026 reality
The user-agent landscape in May 2026:
| Bot | Vendor | Purpose | Block to opt out of |
|---|---|---|---|
GPTBot | OpenAI | Model training | Training only |
OAI-SearchBot | OpenAI | ChatGPT Search live retrieval | Visibility (do NOT block) |
ChatGPT-User | OpenAI | On-demand user fetches | User-driven reads |
ClaudeBot | Anthropic | Model training | Training only |
Anthropic-AI | Anthropic | Older training crawler | Training only |
Claude-Web | Anthropic | On-demand fetches via Claude | User-driven reads |
Google-Extended | Gemini training opt-out | Training only | |
Googlebot | Google search index, AIO, AI Mode | Do NOT block | |
PerplexityBot | Perplexity | Index crawl + retrieval | Visibility |
Perplexity-User | Perplexity | On-demand user fetches | User-driven reads |
CCBot | Common Crawl | Open dataset (used by many models) | Training pipelines |
Bytespider | ByteDance | TikTok / Doubao training | Training |
Meta-ExternalAgent | Meta | Llama training | Training |
MistralAI-User | Mistral | On-demand fetches | Mistral usage |
Diffbot | Diffbot | Knowledge graph, used by some LLM stacks | Knowledge graph |
The visibility vs. IP-protection trade-off. Three positions, all defensible:
- Maximum visibility — allow every crawler. Maximum AI citation potential, but your content is included in training corpora with no compensation or attribution control.
- Block training, allow search — block
GPTBot,ClaudeBot,Google-Extended,Anthropic-AI,CCBot,Bytespider,Meta-ExternalAgent. AllowOAI-SearchBot,Perplexity-User,ChatGPT-User,Claude-Web,Googlebot. The most common 2026 enterprise position. - Block everything — only valid if you don’t want any AI visibility. Common for paywalled news, IP-heavy research, regulated content.
llms.txt and llms-full.txt. Proposed by Jeremy Howard (fast.ai) in September 2024 as a Markdown manifest at /llms.txt listing your most important pages and content for LLMs. /llms-full.txt is a complete dump of that content in Markdown. Adoption status as of May 2026:
| Engine | Honors llms.txt? | Notes |
|---|---|---|
| ChatGPT Search | No official support | OpenAI has not committed to consuming it |
| Perplexity | No official support | Curated index dominates |
| Google AIO/AI Mode | No | Gary Illyes publicly stated they don’t use it |
| Claude with web | No | Brave-mediated retrieval |
| Manual prompts (“read llms.txt”) | Yes | Only when user explicitly references it |
Despite the absence of official support, shipping llms.txt is low-cost and forward-compatible. If adoption arrives in late 2026 or 2027, you’ll be ready. If not, no harm done.
Cloudflare AI Bot Management. Cloudflare ships the most usable production system for AI crawler control:
- AI Audit (Settings → Bots → AI Audit) shows which bots hit your site, request volume, and trends.
- Pay Per Crawl (launched July 2025): negotiate per-request pricing with crawlers; verified bots can pay to crawl.
- Block AI Scrapers and Crawlers one-click toggle blocks all known training-only bots while allowing verified search bots.
- Verified Bot directory distinguishes signed bot identities from spoofers.
Visualizing it
flowchart TD
Bot[Incoming user-agent] --> Verify{Cloudflare verified bot?}
Verify -->|Spoofed| Block1[Block as scraper]
Verify -->|Verified| Class{Bot purpose?}
Class -->|Training only| Train{Allow training?}
Train -->|No| Block2[robots.txt Disallow]
Train -->|Yes| Allow1[Allow + log]
Class -->|Search/retrieval| Allow2[Allow + log]
Class -->|On-demand user fetch| Allow3[Allow + log]
Allow2 --> Index[Indexed for citations]
Allow3 --> Live[Live retrieval to user]
Index --> Cite[Citation in AI surface]
Live --> Cite
Bad vs. expert
The bad approach
Copying a viral “block AI” robots.txt without auditing what you’re blocking. Or doing nothing and assuming silence is consent.
# robots.txt
User-agent: *
Disallow: /
# (Or, the equally common copy-paste from a privacy blog:)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Disallow: /
This fails three ways: it accidentally blocks Googlebot (catastrophe), it blocks OAI-SearchBot (no ChatGPT visibility), and it blocks the on-demand user fetchers (no live citations). Many sites that shipped variants of this discovered six months later that they’d vanished from ChatGPT Search and Perplexity entirely.
The expert approach
Per-bot policy with explicit training-vs-retrieval separation, plus llms.txt as a forward hedge, plus Cloudflare-side enforcement.
# robots.txt — block training, allow retrieval
# Last updated: 2026-04-22
# === Training crawlers: blocked ===
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Anthropic-AI
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# === Retrieval and on-demand: allowed ===
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: MistralAI-User
Allow: /
# === Standard search: always allowed ===
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# === Default for unspecified bots ===
User-agent: *
Allow: /
Sitemap: https://acme.com/sitemap.xml
# /llms.txt
# Acme — short context for LLMs
> Acme is a 2026 SaaS company helping mid-market teams automate procurement.
## Core docs
- [Pricing](https://acme.com/pricing): Plans range $19-$199/month
- [Product overview](https://acme.com/product): Procurement automation features
- [Customer stories](https://acme.com/customers): Verified ROI case studies
## Reference
- [Public benchmark data Q1 2026](https://acme.com/benchmark-2026)
- [API documentation](https://acme.com/api)
# Cloudflare configuration (via dashboard or wrangler)
features:
ai_audit: enabled
block_ai_scrapers_and_crawlers: enabled # auto-blocks unverified scrapers
pay_per_crawl: optional
verified_bots_allow:
- OAI-SearchBot
- PerplexityBot
- Googlebot
- Bingbot
- ClaudeBot # if you choose to allow training
firewall_rules:
- if: "(cf.bot_management.score < 30)"
then: "block"
This wins because it preserves AI visibility on every retrieval surface, blocks training crawlers cleanly, ships llms.txt as a forward hedge, and uses Cloudflare to enforce against spoofed user-agents (where robots.txt is purely advisory).
Do this today
- Open your
robots.txtand audit everyUser-agentblock. ConfirmGooglebot,Bingbot,OAI-SearchBot,PerplexityBot, andClaude-Webare allowed. If any are blocked accidentally, fix it before lunch. - Decide your training policy: block training (most enterprises), or allow it (if you want training-stage memorization). Apply consistently to
GPTBot,ClaudeBot,Anthropic-AI,Google-Extended,CCBot,Bytespider,Meta-ExternalAgent. - Sign in to Cloudflare (or your CDN) and enable AI Audit. Watch a week of traffic to see which bots actually hit you. Surprises are common — Bytespider crawls aggressively even on small sites.
- Toggle Block AI Scrapers and Crawlers in Cloudflare if you want a quick training opt-out without manually maintaining robots.txt.
- Ship
/llms.txtas a Markdown manifest pointing to your top 10 pages. Use the fast.ai llms.txt spec for format. Optionally also ship/llms-full.txtwith a full Markdown export of those pages. - In your CDN logs (or Cloudflare → Logs Engine), set up alerts on
OAI-SearchBot,PerplexityBot,Perplexity-User,Claude-Webcrawl rates. Sudden drops correlate with citation losses 5–14 days later. - Verify bots properly: spoofed user-agents are common. Use Cloudflare’s verified bot list, or do reverse-DNS verification yourself (
OAI-SearchBot→ IP belongs to OpenAI’s published ranges). - If you operate in Europe, document your AI training opt-out for TDM (text and data mining) purposes under EU Copyright Directive Article 4. A machine-readable opt-out via robots.txt + a contractual statement is the current best practice.
- For paid content or high-IP-value pages, consider Cloudflare Pay Per Crawl. It allows verified search crawlers to negotiate per-request pricing rather than blanket-blocking.
- Schedule a quarterly review of the user-agent table above. New bots launch every quarter; the list will continue to evolve. Track via Dark Visitors (darkvisitors.com), which catalogs new AI crawlers as they appear.
Mark complete
Toggle to remember this module as mastered. Saved to your browser only.
More in this part
Part 9: AI Search Optimization (GEO/AEO)
- 065 The AI Search Landscape: Where Discovery Goes Next 24m
- 066 Google AI Overviews 21m
- 067 Google AI Mode 26m
- 068 ChatGPT Search Optimization 22m
- 069 Perplexity Optimization 24m
- 070 Generative Engine Optimization (GEO) Principles 21m
- 071 Answer Engine Optimization (AEO) 20m
- 072 AI Citation Patterns by Platform 17m
- 073 AI Crawler Management You're here 19m
- 074 Earned Media for AI Visibility 16m
- 075 Measuring AI Visibility 20m
- 076 The Future: Agentic Search & AI Browsers 22m