robots.txt Deep Dive

robots.txt is a 30-year-old plain text file at the root of your domain that tells well-behaved crawlers where they may and may not go. It is not a security mechanism — anyone can ignore it — but every major search and AI crawler does honor it. In 2026, it is your first line of editorial control over whether GPTBot, ClaudeBot, PerplexityBot, and Google-Extended can read your work.

TL;DR

Disallow blocks crawling, not indexing. A blocked URL can still appear in the index from external links — Google just cannot describe it. To remove a URL from the index, use noindex (and let the page be crawlable so Google can see it).
The 2024 REP standard formalized syntax that was previously informal. User-agent, Allow, Disallow, and Sitemap are the only universally supported directives. Crawl-delay is honored by Bing and Yandex; Google ignores it.
AI crawler control is now a CEO-level conversation. Block GPTBot/ClaudeBot/PerplexityBot/Google-Extended/Applebot-Extended and you stay out of training and AI summaries. Allow them and you might earn citations. Pick a position deliberately.

The mental model

robots.txt is the bouncer’s clipboard at the front door. Every crawler that respects the protocol checks the clipboard before entering. The bouncer is honest but not aggressive — there is no enforcement beyond the social contract. Bots that respect the clipboard (Googlebot, Bingbot, GPTBot, ClaudeBot) follow it. Bots that do not are not stopped by the file, only by other systems (WAFs, rate limits, user-agent blocks).

The clipboard’s grammar is simple: a list of crawler names (User-agent), a list of paths each may not enter (Disallow), exceptions to those paths (Allow), and a list of sitemap URLs (Sitemap). Specificity matters — the longest matching rule wins for Google, while Bing uses first-match order.

The most common misunderstanding: people use Disallow thinking it removes a URL from the index. It does not. It only prevents future crawls. To deindex, you need the page to remain crawlable so Google can read the noindex directive. Block-and-noindex is incompatible.

Deep dive: the 2026 reality

The Robots Exclusion Protocol was formalized as RFC 9309 in September 2022. The standard codified what was already common practice — case-insensitive directive names, longest-match semantics for Allow/Disallow, support for * and $ wildcards in Google’s implementation.

The current crawler landscape every site owner should know:

Crawler	User-agent	Purpose	Honors `robots.txt`	JS execution
Googlebot	`Googlebot`	Google Search index	Yes	Yes (Chrome 124+)
Bingbot	`bingbot`	Bing index, ChatGPT Search, Copilot	Yes	Yes (limited)
Google-Extended	`Google-Extended`	Gemini training, AI Overviews input	Yes (separate token)	N/A — controls usage
GPTBot	`GPTBot`	OpenAI training	Yes	No
OAI-SearchBot	`OAI-SearchBot`	ChatGPT Search retrieval	Yes	No
ChatGPT-User	`ChatGPT-User`	User-initiated browsing in ChatGPT	Yes	Yes
ClaudeBot	`ClaudeBot`	Anthropic training and Claude with web	Yes	No
Claude-User	`Claude-User`	User-initiated Claude searches	Yes	No
PerplexityBot	`PerplexityBot`	Perplexity index	Yes	Limited
Perplexity-User	`Perplexity-User`	User-initiated Perplexity fetches	Disputed (2024 Wired story)	Yes
Applebot	`Applebot`	Spotlight, Siri	Yes	Yes
Applebot-Extended	`Applebot-Extended`	Apple Intelligence training	Yes	N/A
CCBot	`CCBot`	Common Crawl (used as training input)	Yes	No
Amazonbot	`Amazonbot`	Alexa, Amazon search	Yes	No
Bytespider	`Bytespider`	ByteDance / Doubao training	Often ignored	No
DuckDuckBot	`DuckDuckBot`	DuckDuckGo	Yes	No

Two ongoing 2026 controversies worth naming. Perplexity’s user-initiated crawler (Perplexity-User) was caught ignoring robots.txt in a June 2024 Wired investigation; Perplexity argued user-fetches are not crawls. Cloudflare introduced AI-bot blocking at the edge in response. Google-Extended does not block training of older Google AI models retroactively, only future use.

Visualizing it

flowchart TD
  A[Bot requests page] --> B{robots.txt fetched?}
  B -->|No| C[Bot fetches /robots.txt]
  C --> B
  B -->|Yes, cached| D{Match user-agent token}
  D --> E{Path matches Disallow?}
  E -->|No| F[Crawl allowed]
  E -->|Yes| G{Path matches Allow that's longer?}
  G -->|Yes| F
  G -->|No| H[Crawl blocked]
  F --> I[Fetch page]
  I --> J{noindex meta or X-Robots-Tag?}
  J -->|Yes| K[Crawled, not indexed]
  J -->|No| L[Eligible for index]

Bad vs. expert

The bad approach

Two common bad patterns. First, the panic block — the team wants to remove a section from Google, so they Disallow it:

# example.com/robots.txt — WRONG
User-agent: *
Disallow: /private/
Disallow: /old-promo-2023/
Disallow: /admin/

Six weeks later, site:example.com inurl:old-promo-2023 still shows results, because external links to those URLs make Google list them in the index — without descriptions — even though Google cannot crawl them. The fix the team wanted required a noindex (which requires crawlability), not a block.

Second, the accidental sitewide block — a developer copies a staging robots.txt to production:

# DO NOT DEPLOY
User-agent: *
Disallow: /

This single line, accidentally pushed, deindexes an entire site within days. Real-world cases: Asos in 2015 (10-day outage cost ~£100M in lost organic), and the November 2023 incident where a major US news site blocked / for 18 hours.

The expert approach

A defensible 2026 production robots.txt:

# example.com/robots.txt
# Last reviewed: 2026-04-15

# All search crawlers: allow most paths, block noisy infrastructure
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /cart/
Disallow: /account/
Allow: /api/og-image/$

# Googlebot specific (longest-match wins for Google)
User-agent: Googlebot
Disallow: /admin/
Disallow: /api/
Allow: /api/og-image/

# AI training crawlers — editorial decision: blocked
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# AI search retrieval crawlers — allowed for citations
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Why this works: faceted URL parameters (?sort=, ?filter=) and infinite-space paths are blocked to preserve crawl budget. AI training is denied while AI search retrieval is permitted — meaning content can earn citations in ChatGPT Search and Perplexity but is not used to train future models. Sitemaps are listed at the bottom for crawler discovery.

To deindex /old-promo-2023/ properly, you would not block it in robots. You would let Googlebot crawl it and serve a noindex:

# nginx config for the deindex case
location /old-promo-2023/ {
  add_header X-Robots-Tag "noindex" always;
  try_files $uri $uri/ =404;
}

After Googlebot recrawls and processes the noindex (typically 2–4 weeks for sites on a daily crawl cadence), the URLs drop from the index. Once they are gone, you can choose to 410 them, redirect them, or block them.

Do this today

Visit https://yourdomain.com/robots.txt and read it line by line. If the file does not exist, your server returns 404 — not catastrophic, but worth fixing for crawl efficiency. Create a minimal one with User-agent: * and your sitemap URL.
Test against Google’s robots.txt Tester (now inside GSC > Settings > Crawling > robots.txt). Click any URL on your site and confirm the verdict matches your intent. Any Disallowed revenue URL is a five-alarm fire.
Test against Bing Webmaster Tools > Configure > robots.txt Tester as well. Bing applies first-match order rather than longest-match — what works for Google can mis-fire on Bing.
List every AI crawler user-agent you have a position on. Decide explicitly: is your content for training? For retrieval/citation only? For neither? Document the answer. The decisions should be made by the editorial lead, not the developer.
If blocking AI training, add explicit User-agent blocks for GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Bytespider. Confirm with OpenAI’s GPTBot documentation that they updated their crawler IP ranges within 24 hours of your change.
Audit for Crawl-delay directives. Google ignores them; Bing/Yandex honor. If yours is set high (e.g. Crawl-delay: 30), Bing crawls less, which slows ChatGPT Search and Copilot indexation. Lower it or remove it unless you have host-load reasons.
Run a Screaming Frog crawl with Configuration > robots.txt > Settings set to Respect robots.txt, then re-crawl with Ignore robots.txt. Diff the two URL lists. Any URL in the second list but not the first is currently blocked — confirm intent.
Set up GSC alerts for Settings > Crawl Stats > robots.txt fetch errors. A robots.txt that 5xx’s is treated as “fetch unsuccessful” and Google may stop crawling temporarily.
Add a CI check in your deploy pipeline that fails if robots.txt contains the line Disallow: / for User-agent: *. One regex prevents the entire failure mode.