Module 073 Advanced 19 min read

AI Crawler Management

robots.txt for AI bots, llms.txt and llms-full.txt status and debate, Cloudflare AI bot management, and the visibility-vs-IP-protection trade-off.

By SEO Mastery Editorial

The fastest way to lose AI visibility in 2026 is to copy a “block all AI bots” robots.txt template from a privacy-first blog post and ship it without thinking. The fastest way to gain AI visibility is to make a deliberate, per-bot policy distinguishing training crawlers from search/retrieval crawlers and let the right ones in. This module covers that policy, plus the live debate around llms.txt.

TL;DR

  • Training crawlers and search crawlers are different bots. GPTBot (training) is separate from OAI-SearchBot (live search retrieval). ClaudeBot and Anthropic-AI (training) are separate from Claude’s web tools (Brave-mediated). Block training, allow search if you want citations.
  • llms.txt is a proposed standard, not a deployed one. As of May 2026, no major AI engine officially honors llms.txt for retrieval prioritization. Implement it as a low-cost hedge but don’t rely on it as a control surface.
  • Cloudflare’s AI bot management gives the most granular control. The verified bot directory plus per-bot allow/block rules, plus AI Audit and pay-per-crawl, are the production-grade lever. Start there if you need real enforcement.

The mental model

AI crawler management is like managing visitors to a museum gift shop. The gift shop wants tourists who’ll buy postcards (search retrieval bots that send users back to your site). It doesn’t necessarily want photographers who’ll publish your inventory in their own catalog with no attribution and no kickback (training crawlers that bake your content into a model with zero traffic return).

But you can’t simply ban all cameras. Some “photographers” turn out to be journalists who’ll write a feature about the gift shop and bring you customers (search crawlers and on-demand fetchers from the same vendor). The right policy is per-camera: which lens, on whose behalf, for what use. That’s exactly the per-user-agent policy the rest of this module describes.

The llms.txt debate fits inside this metaphor: it’s a sign on the door saying “preferred photographers, please use this gallery first.” Whether photographers honor the sign is up to them; today most don’t.

Deep dive: the 2026 reality

The user-agent landscape in May 2026:

BotVendorPurposeBlock to opt out of
GPTBotOpenAIModel trainingTraining only
OAI-SearchBotOpenAIChatGPT Search live retrievalVisibility (do NOT block)
ChatGPT-UserOpenAIOn-demand user fetchesUser-driven reads
ClaudeBotAnthropicModel trainingTraining only
Anthropic-AIAnthropicOlder training crawlerTraining only
Claude-WebAnthropicOn-demand fetches via ClaudeUser-driven reads
Google-ExtendedGoogleGemini training opt-outTraining only
GooglebotGoogleGoogle search index, AIO, AI ModeDo NOT block
PerplexityBotPerplexityIndex crawl + retrievalVisibility
Perplexity-UserPerplexityOn-demand user fetchesUser-driven reads
CCBotCommon CrawlOpen dataset (used by many models)Training pipelines
BytespiderByteDanceTikTok / Doubao trainingTraining
Meta-ExternalAgentMetaLlama trainingTraining
MistralAI-UserMistralOn-demand fetchesMistral usage
DiffbotDiffbotKnowledge graph, used by some LLM stacksKnowledge graph

The visibility vs. IP-protection trade-off. Three positions, all defensible:

  1. Maximum visibility — allow every crawler. Maximum AI citation potential, but your content is included in training corpora with no compensation or attribution control.
  2. Block training, allow search — block GPTBot, ClaudeBot, Google-Extended, Anthropic-AI, CCBot, Bytespider, Meta-ExternalAgent. Allow OAI-SearchBot, Perplexity-User, ChatGPT-User, Claude-Web, Googlebot. The most common 2026 enterprise position.
  3. Block everything — only valid if you don’t want any AI visibility. Common for paywalled news, IP-heavy research, regulated content.

llms.txt and llms-full.txt. Proposed by Jeremy Howard (fast.ai) in September 2024 as a Markdown manifest at /llms.txt listing your most important pages and content for LLMs. /llms-full.txt is a complete dump of that content in Markdown. Adoption status as of May 2026:

EngineHonors llms.txt?Notes
ChatGPT SearchNo official supportOpenAI has not committed to consuming it
PerplexityNo official supportCurated index dominates
Google AIO/AI ModeNoGary Illyes publicly stated they don’t use it
Claude with webNoBrave-mediated retrieval
Manual prompts (“read llms.txt”)YesOnly when user explicitly references it

Despite the absence of official support, shipping llms.txt is low-cost and forward-compatible. If adoption arrives in late 2026 or 2027, you’ll be ready. If not, no harm done.

Cloudflare AI Bot Management. Cloudflare ships the most usable production system for AI crawler control:

  • AI Audit (Settings → Bots → AI Audit) shows which bots hit your site, request volume, and trends.
  • Pay Per Crawl (launched July 2025): negotiate per-request pricing with crawlers; verified bots can pay to crawl.
  • Block AI Scrapers and Crawlers one-click toggle blocks all known training-only bots while allowing verified search bots.
  • Verified Bot directory distinguishes signed bot identities from spoofers.

Visualizing it

flowchart TD
  Bot[Incoming user-agent] --> Verify{Cloudflare verified bot?}
  Verify -->|Spoofed| Block1[Block as scraper]
  Verify -->|Verified| Class{Bot purpose?}
  Class -->|Training only| Train{Allow training?}
  Train -->|No| Block2[robots.txt Disallow]
  Train -->|Yes| Allow1[Allow + log]
  Class -->|Search/retrieval| Allow2[Allow + log]
  Class -->|On-demand user fetch| Allow3[Allow + log]
  Allow2 --> Index[Indexed for citations]
  Allow3 --> Live[Live retrieval to user]
  Index --> Cite[Citation in AI surface]
  Live --> Cite

Bad vs. expert

The bad approach

Copying a viral “block AI” robots.txt without auditing what you’re blocking. Or doing nothing and assuming silence is consent.

# robots.txt
User-agent: *
Disallow: /

# (Or, the equally common copy-paste from a privacy blog:)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Disallow: /

This fails three ways: it accidentally blocks Googlebot (catastrophe), it blocks OAI-SearchBot (no ChatGPT visibility), and it blocks the on-demand user fetchers (no live citations). Many sites that shipped variants of this discovered six months later that they’d vanished from ChatGPT Search and Perplexity entirely.

The expert approach

Per-bot policy with explicit training-vs-retrieval separation, plus llms.txt as a forward hedge, plus Cloudflare-side enforcement.

# robots.txt — block training, allow retrieval
# Last updated: 2026-04-22

# === Training crawlers: blocked ===
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Anthropic-AI
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# === Retrieval and on-demand: allowed ===
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: MistralAI-User
Allow: /

# === Standard search: always allowed ===
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# === Default for unspecified bots ===
User-agent: *
Allow: /

Sitemap: https://acme.com/sitemap.xml
# /llms.txt
# Acme — short context for LLMs

> Acme is a 2026 SaaS company helping mid-market teams automate procurement.

## Core docs

- [Pricing](https://acme.com/pricing): Plans range $19-$199/month
- [Product overview](https://acme.com/product): Procurement automation features
- [Customer stories](https://acme.com/customers): Verified ROI case studies

## Reference

- [Public benchmark data Q1 2026](https://acme.com/benchmark-2026)
- [API documentation](https://acme.com/api)
# Cloudflare configuration (via dashboard or wrangler)
features:
  ai_audit: enabled
  block_ai_scrapers_and_crawlers: enabled  # auto-blocks unverified scrapers
  pay_per_crawl: optional
verified_bots_allow:
  - OAI-SearchBot
  - PerplexityBot
  - Googlebot
  - Bingbot
  - ClaudeBot  # if you choose to allow training
firewall_rules:
  - if: "(cf.bot_management.score < 30)"
    then: "block"

This wins because it preserves AI visibility on every retrieval surface, blocks training crawlers cleanly, ships llms.txt as a forward hedge, and uses Cloudflare to enforce against spoofed user-agents (where robots.txt is purely advisory).

Do this today

  1. Open your robots.txt and audit every User-agent block. Confirm Googlebot, Bingbot, OAI-SearchBot, PerplexityBot, and Claude-Web are allowed. If any are blocked accidentally, fix it before lunch.
  2. Decide your training policy: block training (most enterprises), or allow it (if you want training-stage memorization). Apply consistently to GPTBot, ClaudeBot, Anthropic-AI, Google-Extended, CCBot, Bytespider, Meta-ExternalAgent.
  3. Sign in to Cloudflare (or your CDN) and enable AI Audit. Watch a week of traffic to see which bots actually hit you. Surprises are common — Bytespider crawls aggressively even on small sites.
  4. Toggle Block AI Scrapers and Crawlers in Cloudflare if you want a quick training opt-out without manually maintaining robots.txt.
  5. Ship /llms.txt as a Markdown manifest pointing to your top 10 pages. Use the fast.ai llms.txt spec for format. Optionally also ship /llms-full.txt with a full Markdown export of those pages.
  6. In your CDN logs (or Cloudflare → Logs Engine), set up alerts on OAI-SearchBot, PerplexityBot, Perplexity-User, Claude-Web crawl rates. Sudden drops correlate with citation losses 5–14 days later.
  7. Verify bots properly: spoofed user-agents are common. Use Cloudflare’s verified bot list, or do reverse-DNS verification yourself (OAI-SearchBot → IP belongs to OpenAI’s published ranges).
  8. If you operate in Europe, document your AI training opt-out for TDM (text and data mining) purposes under EU Copyright Directive Article 4. A machine-readable opt-out via robots.txt + a contractual statement is the current best practice.
  9. For paid content or high-IP-value pages, consider Cloudflare Pay Per Crawl. It allows verified search crawlers to negotiate per-request pricing rather than blanket-blocking.
  10. Schedule a quarterly review of the user-agent table above. New bots launch every quarter; the list will continue to evolve. Track via Dark Visitors (darkvisitors.com), which catalogs new AI crawlers as they appear.

Mark complete

Toggle to remember this module as mastered. Saved to your browser only.

More in this part

Part 9: AI Search Optimization (GEO/AEO)

View all on the home page →
  1. 065 The AI Search Landscape: Where Discovery Goes Next 24m
  2. 066 Google AI Overviews 21m
  3. 067 Google AI Mode 26m
  4. 068 ChatGPT Search Optimization 22m
  5. 069 Perplexity Optimization 24m
  6. 070 Generative Engine Optimization (GEO) Principles 21m
  7. 071 Answer Engine Optimization (AEO) 20m
  8. 072 AI Citation Patterns by Platform 17m
  9. 073 AI Crawler Management You're here 19m
  10. 074 Earned Media for AI Visibility 16m
  11. 075 Measuring AI Visibility 20m
  12. 076 The Future: Agentic Search & AI Browsers 22m