AI Crawler Management

The fastest way to lose AI visibility in 2026 is to copy a “block all AI bots” robots.txt template from a privacy-first blog post and ship it without thinking. The fastest way to gain AI visibility is to make a deliberate, per-bot policy distinguishing training crawlers from search/retrieval crawlers and let the right ones in. This module covers that policy, plus the live debate around llms.txt.

TL;DR

Training crawlers and search crawlers are different bots. GPTBot (training) is separate from OAI-SearchBot (live search retrieval). ClaudeBot and Anthropic-AI (training) are separate from Claude’s web tools (Brave-mediated). Block training, allow search if you want citations.
llms.txt is a proposed standard, not a deployed one. As of May 2026, no major AI engine officially honors llms.txt for retrieval prioritization. Implement it as a low-cost hedge but don’t rely on it as a control surface.
Cloudflare’s AI bot management gives the most granular control. The verified bot directory plus per-bot allow/block rules, plus AI Audit and pay-per-crawl, are the production-grade lever. Start there if you need real enforcement.

The mental model

AI crawler management is like managing visitors to a museum gift shop. The gift shop wants tourists who’ll buy postcards (search retrieval bots that send users back to your site). It doesn’t necessarily want photographers who’ll publish your inventory in their own catalog with no attribution and no kickback (training crawlers that bake your content into a model with zero traffic return).

But you can’t simply ban all cameras. Some “photographers” turn out to be journalists who’ll write a feature about the gift shop and bring you customers (search crawlers and on-demand fetchers from the same vendor). The right policy is per-camera: which lens, on whose behalf, for what use. That’s exactly the per-user-agent policy the rest of this module describes.

The llms.txt debate fits inside this metaphor: it’s a sign on the door saying “preferred photographers, please use this gallery first.” Whether photographers honor the sign is up to them; today most don’t.

Deep dive: the 2026 reality

The user-agent landscape in May 2026:

Bot	Vendor	Purpose	Block to opt out of
`GPTBot`	OpenAI	Model training	Training only
`OAI-SearchBot`	OpenAI	ChatGPT Search live retrieval	Visibility (do NOT block)
`ChatGPT-User`	OpenAI	On-demand user fetches	User-driven reads
`ClaudeBot`	Anthropic	Model training	Training only
`Anthropic-AI`	Anthropic	Older training crawler	Training only
`Claude-Web`	Anthropic	On-demand fetches via Claude	User-driven reads
`Google-Extended`	Google	Gemini training opt-out	Training only
`Googlebot`	Google	Google search index, AIO, AI Mode	Do NOT block
`PerplexityBot`	Perplexity	Index crawl + retrieval	Visibility
`Perplexity-User`	Perplexity	On-demand user fetches	User-driven reads
`CCBot`	Common Crawl	Open dataset (used by many models)	Training pipelines
`Bytespider`	ByteDance	TikTok / Doubao training	Training
`Meta-ExternalAgent`	Meta	Llama training	Training
`MistralAI-User`	Mistral	On-demand fetches	Mistral usage
`Diffbot`	Diffbot	Knowledge graph, used by some LLM stacks	Knowledge graph

The visibility vs. IP-protection trade-off. Three positions, all defensible:

Maximum visibility — allow every crawler. Maximum AI citation potential, but your content is included in training corpora with no compensation or attribution control.
Block training, allow search — block GPTBot, ClaudeBot, Google-Extended, Anthropic-AI, CCBot, Bytespider, Meta-ExternalAgent. Allow OAI-SearchBot, Perplexity-User, ChatGPT-User, Claude-Web, Googlebot. The most common 2026 enterprise position.
Block everything — only valid if you don’t want any AI visibility. Common for paywalled news, IP-heavy research, regulated content.

llms.txt and llms-full.txt. Proposed by Jeremy Howard (fast.ai) in September 2024 as a Markdown manifest at /llms.txt listing your most important pages and content for LLMs. /llms-full.txt is a complete dump of that content in Markdown. Adoption status as of May 2026:

Engine	Honors llms.txt?	Notes
ChatGPT Search	No official support	OpenAI has not committed to consuming it
Perplexity	No official support	Curated index dominates
Google AIO/AI Mode	No	Gary Illyes publicly stated they don’t use it
Claude with web	No	Brave-mediated retrieval
Manual prompts (“read llms.txt”)	Yes	Only when user explicitly references it

Despite the absence of official support, shipping llms.txt is low-cost and forward-compatible. If adoption arrives in late 2026 or 2027, you’ll be ready. If not, no harm done.

Cloudflare AI Bot Management. Cloudflare ships the most usable production system for AI crawler control:

AI Audit (Settings → Bots → AI Audit) shows which bots hit your site, request volume, and trends.
Pay Per Crawl (launched July 2025): negotiate per-request pricing with crawlers; verified bots can pay to crawl.
Block AI Scrapers and Crawlers one-click toggle blocks all known training-only bots while allowing verified search bots.
Verified Bot directory distinguishes signed bot identities from spoofers.

Visualizing it

flowchart TD
  Bot[Incoming user-agent] --> Verify{Cloudflare verified bot?}
  Verify -->|Spoofed| Block1[Block as scraper]
  Verify -->|Verified| Class{Bot purpose?}
  Class -->|Training only| Train{Allow training?}
  Train -->|No| Block2[robots.txt Disallow]
  Train -->|Yes| Allow1[Allow + log]
  Class -->|Search/retrieval| Allow2[Allow + log]
  Class -->|On-demand user fetch| Allow3[Allow + log]
  Allow2 --> Index[Indexed for citations]
  Allow3 --> Live[Live retrieval to user]
  Index --> Cite[Citation in AI surface]
  Live --> Cite

Bad vs. expert

The bad approach

Copying a viral “block AI” robots.txt without auditing what you’re blocking. Or doing nothing and assuming silence is consent.

# robots.txt
User-agent: *
Disallow: /

# (Or, the equally common copy-paste from a privacy blog:)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Disallow: /

This fails three ways: it accidentally blocks Googlebot (catastrophe), it blocks OAI-SearchBot (no ChatGPT visibility), and it blocks the on-demand user fetchers (no live citations). Many sites that shipped variants of this discovered six months later that they’d vanished from ChatGPT Search and Perplexity entirely.

The expert approach

Per-bot policy with explicit training-vs-retrieval separation, plus llms.txt as a forward hedge, plus Cloudflare-side enforcement.

# robots.txt — block training, allow retrieval
# Last updated: 2026-04-22

# === Training crawlers: blocked ===
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Anthropic-AI
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# === Retrieval and on-demand: allowed ===
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: MistralAI-User
Allow: /

# === Standard search: always allowed ===
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# === Default for unspecified bots ===
User-agent: *
Allow: /

Sitemap: https://acme.com/sitemap.xml

# /llms.txt
# Acme — short context for LLMs

> Acme is a 2026 SaaS company helping mid-market teams automate procurement.

## Core docs

- [Pricing](https://acme.com/pricing): Plans range $19-$199/month
- [Product overview](https://acme.com/product): Procurement automation features
- [Customer stories](https://acme.com/customers): Verified ROI case studies

## Reference

- [Public benchmark data Q1 2026](https://acme.com/benchmark-2026)
- [API documentation](https://acme.com/api)

# Cloudflare configuration (via dashboard or wrangler)
features:
  ai_audit: enabled
  block_ai_scrapers_and_crawlers: enabled  # auto-blocks unverified scrapers
  pay_per_crawl: optional
verified_bots_allow:
  - OAI-SearchBot
  - PerplexityBot
  - Googlebot
  - Bingbot
  - ClaudeBot  # if you choose to allow training
firewall_rules:
  - if: "(cf.bot_management.score < 30)"
    then: "block"

This wins because it preserves AI visibility on every retrieval surface, blocks training crawlers cleanly, ships llms.txt as a forward hedge, and uses Cloudflare to enforce against spoofed user-agents (where robots.txt is purely advisory).

Do this today

Open your robots.txt and audit every User-agent block. Confirm Googlebot, Bingbot, OAI-SearchBot, PerplexityBot, and Claude-Web are allowed. If any are blocked accidentally, fix it before lunch.
Decide your training policy: block training (most enterprises), or allow it (if you want training-stage memorization). Apply consistently to GPTBot, ClaudeBot, Anthropic-AI, Google-Extended, CCBot, Bytespider, Meta-ExternalAgent.
Sign in to Cloudflare (or your CDN) and enable AI Audit. Watch a week of traffic to see which bots actually hit you. Surprises are common — Bytespider crawls aggressively even on small sites.
Toggle Block AI Scrapers and Crawlers in Cloudflare if you want a quick training opt-out without manually maintaining robots.txt.
Ship /llms.txt as a Markdown manifest pointing to your top 10 pages. Use the fast.ai llms.txt spec for format. Optionally also ship /llms-full.txt with a full Markdown export of those pages.
In your CDN logs (or Cloudflare → Logs Engine), set up alerts on OAI-SearchBot, PerplexityBot, Perplexity-User, Claude-Web crawl rates. Sudden drops correlate with citation losses 5–14 days later.
Verify bots properly: spoofed user-agents are common. Use Cloudflare’s verified bot list, or do reverse-DNS verification yourself (OAI-SearchBot → IP belongs to OpenAI’s published ranges).
If you operate in Europe, document your AI training opt-out for TDM (text and data mining) purposes under EU Copyright Directive Article 4. A machine-readable opt-out via robots.txt + a contractual statement is the current best practice.
For paid content or high-IP-value pages, consider Cloudflare Pay Per Crawl. It allows verified search crawlers to negotiate per-request pricing rather than blanket-blocking.
Schedule a quarterly review of the user-agent table above. New bots launch every quarter; the list will continue to evolve. Track via Dark Visitors (darkvisitors.com), which catalogs new AI crawlers as they appear.