robots.txt Deep Dive
Syntax, directives, common mistakes, blocking AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) — and the difference between blocking and noindexing.
robots.txt is a 30-year-old plain text file at the root of your domain that tells well-behaved crawlers where they may and may not go. It is not a security mechanism — anyone can ignore it — but every major search and AI crawler does honor it. In 2026, it is your first line of editorial control over whether GPTBot, ClaudeBot, PerplexityBot, and Google-Extended can read your work.
TL;DR
Disallowblocks crawling, not indexing. A blocked URL can still appear in the index from external links — Google just cannot describe it. To remove a URL from the index, usenoindex(and let the page be crawlable so Google can see it).- The 2024 REP standard formalized syntax that was previously informal.
User-agent,Allow,Disallow, andSitemapare the only universally supported directives.Crawl-delayis honored by Bing and Yandex; Google ignores it. - AI crawler control is now a CEO-level conversation. Block
GPTBot/ClaudeBot/PerplexityBot/Google-Extended/Applebot-Extendedand you stay out of training and AI summaries. Allow them and you might earn citations. Pick a position deliberately.
The mental model
robots.txt is the bouncer’s clipboard at the front door. Every crawler that respects the protocol checks the clipboard before entering. The bouncer is honest but not aggressive — there is no enforcement beyond the social contract. Bots that respect the clipboard (Googlebot, Bingbot, GPTBot, ClaudeBot) follow it. Bots that do not are not stopped by the file, only by other systems (WAFs, rate limits, user-agent blocks).
The clipboard’s grammar is simple: a list of crawler names (User-agent), a list of paths each may not enter (Disallow), exceptions to those paths (Allow), and a list of sitemap URLs (Sitemap). Specificity matters — the longest matching rule wins for Google, while Bing uses first-match order.
The most common misunderstanding: people use Disallow thinking it removes a URL from the index. It does not. It only prevents future crawls. To deindex, you need the page to remain crawlable so Google can read the noindex directive. Block-and-noindex is incompatible.
Deep dive: the 2026 reality
The Robots Exclusion Protocol was formalized as RFC 9309 in September 2022. The standard codified what was already common practice — case-insensitive directive names, longest-match semantics for Allow/Disallow, support for * and $ wildcards in Google’s implementation.
The current crawler landscape every site owner should know:
| Crawler | User-agent | Purpose | Honors robots.txt | JS execution |
|---|---|---|---|---|
| Googlebot | Googlebot | Google Search index | Yes | Yes (Chrome 124+) |
| Bingbot | bingbot | Bing index, ChatGPT Search, Copilot | Yes | Yes (limited) |
| Google-Extended | Google-Extended | Gemini training, AI Overviews input | Yes (separate token) | N/A — controls usage |
| GPTBot | GPTBot | OpenAI training | Yes | No |
| OAI-SearchBot | OAI-SearchBot | ChatGPT Search retrieval | Yes | No |
| ChatGPT-User | ChatGPT-User | User-initiated browsing in ChatGPT | Yes | Yes |
| ClaudeBot | ClaudeBot | Anthropic training and Claude with web | Yes | No |
| Claude-User | Claude-User | User-initiated Claude searches | Yes | No |
| PerplexityBot | PerplexityBot | Perplexity index | Yes | Limited |
| Perplexity-User | Perplexity-User | User-initiated Perplexity fetches | Disputed (2024 Wired story) | Yes |
| Applebot | Applebot | Spotlight, Siri | Yes | Yes |
| Applebot-Extended | Applebot-Extended | Apple Intelligence training | Yes | N/A |
| CCBot | CCBot | Common Crawl (used as training input) | Yes | No |
| Amazonbot | Amazonbot | Alexa, Amazon search | Yes | No |
| Bytespider | Bytespider | ByteDance / Doubao training | Often ignored | No |
| DuckDuckBot | DuckDuckBot | DuckDuckGo | Yes | No |
Two ongoing 2026 controversies worth naming. Perplexity’s user-initiated crawler (Perplexity-User) was caught ignoring robots.txt in a June 2024 Wired investigation; Perplexity argued user-fetches are not crawls. Cloudflare introduced AI-bot blocking at the edge in response. Google-Extended does not block training of older Google AI models retroactively, only future use.
Visualizing it
flowchart TD
A[Bot requests page] --> B{robots.txt fetched?}
B -->|No| C[Bot fetches /robots.txt]
C --> B
B -->|Yes, cached| D{Match user-agent token}
D --> E{Path matches Disallow?}
E -->|No| F[Crawl allowed]
E -->|Yes| G{Path matches Allow that's longer?}
G -->|Yes| F
G -->|No| H[Crawl blocked]
F --> I[Fetch page]
I --> J{noindex meta or X-Robots-Tag?}
J -->|Yes| K[Crawled, not indexed]
J -->|No| L[Eligible for index]
Bad vs. expert
The bad approach
Two common bad patterns. First, the panic block — the team wants to remove a section from Google, so they Disallow it:
# example.com/robots.txt — WRONG
User-agent: *
Disallow: /private/
Disallow: /old-promo-2023/
Disallow: /admin/
Six weeks later, site:example.com inurl:old-promo-2023 still shows results, because external links to those URLs make Google list them in the index — without descriptions — even though Google cannot crawl them. The fix the team wanted required a noindex (which requires crawlability), not a block.
Second, the accidental sitewide block — a developer copies a staging robots.txt to production:
# DO NOT DEPLOY
User-agent: *
Disallow: /
This single line, accidentally pushed, deindexes an entire site within days. Real-world cases: Asos in 2015 (10-day outage cost ~£100M in lost organic), and the November 2023 incident where a major US news site blocked / for 18 hours.
The expert approach
A defensible 2026 production robots.txt:
# example.com/robots.txt
# Last reviewed: 2026-04-15
# All search crawlers: allow most paths, block noisy infrastructure
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /cart/
Disallow: /account/
Allow: /api/og-image/$
# Googlebot specific (longest-match wins for Google)
User-agent: Googlebot
Disallow: /admin/
Disallow: /api/
Allow: /api/og-image/
# AI training crawlers — editorial decision: blocked
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# AI search retrieval crawlers — allowed for citations
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Why this works: faceted URL parameters (?sort=, ?filter=) and infinite-space paths are blocked to preserve crawl budget. AI training is denied while AI search retrieval is permitted — meaning content can earn citations in ChatGPT Search and Perplexity but is not used to train future models. Sitemaps are listed at the bottom for crawler discovery.
To deindex /old-promo-2023/ properly, you would not block it in robots. You would let Googlebot crawl it and serve a noindex:
# nginx config for the deindex case
location /old-promo-2023/ {
add_header X-Robots-Tag "noindex" always;
try_files $uri $uri/ =404;
}
After Googlebot recrawls and processes the noindex (typically 2–4 weeks for sites on a daily crawl cadence), the URLs drop from the index. Once they are gone, you can choose to 410 them, redirect them, or block them.
Do this today
- Visit
https://yourdomain.com/robots.txtand read it line by line. If the file does not exist, your server returns 404 — not catastrophic, but worth fixing for crawl efficiency. Create a minimal one withUser-agent: *and your sitemap URL. - Test against Google’s robots.txt Tester (now inside GSC > Settings > Crawling > robots.txt). Click any URL on your site and confirm the verdict matches your intent. Any
Disallowedrevenue URL is a five-alarm fire. - Test against Bing Webmaster Tools > Configure > robots.txt Tester as well. Bing applies first-match order rather than longest-match — what works for Google can mis-fire on Bing.
- List every AI crawler user-agent you have a position on. Decide explicitly: is your content for training? For retrieval/citation only? For neither? Document the answer. The decisions should be made by the editorial lead, not the developer.
- If blocking AI training, add explicit
User-agentblocks forGPTBot,ClaudeBot,Google-Extended,Applebot-Extended,CCBot,Bytespider. Confirm with OpenAI’s GPTBot documentation that they updated their crawler IP ranges within 24 hours of your change. - Audit for
Crawl-delaydirectives. Google ignores them; Bing/Yandex honor. If yours is set high (e.g.Crawl-delay: 30), Bing crawls less, which slows ChatGPT Search and Copilot indexation. Lower it or remove it unless you have host-load reasons. - Run a Screaming Frog crawl with Configuration > robots.txt > Settings set to Respect robots.txt, then re-crawl with Ignore robots.txt. Diff the two URL lists. Any URL in the second list but not the first is currently blocked — confirm intent.
- Set up GSC alerts for Settings > Crawl Stats > robots.txt fetch errors. A robots.txt that 5xx’s is treated as “fetch unsuccessful” and Google may stop crawling temporarily.
- Add a CI check in your deploy pipeline that fails if
robots.txtcontains the lineDisallow: /forUser-agent: *. One regex prevents the entire failure mode.
Mark complete
Toggle to remember this module as mastered. Saved to your browser only.
More in this part
Part 5: Technical SEO
- 026 Technical SEO Fundamentals 12m
- 027 Site Architecture 20m
- 028 Crawling & Indexing 17m
- 029 robots.txt Deep Dive You're here 15m
- 030 XML Sitemaps 12m
- 031 Canonical Tags 20m
- 032 Meta Robots & X-Robots-Tag 13m
- 033 HTTP Status Codes 15m
- 034 Crawl Budget Management 16m
- 035 JavaScript SEO 26m
- 036 Core Web Vitals 17m
- 037 Site Speed & Performance 19m
- 038 HTTPS & Site Security 12m
- 039 Mobile SEO & Mobile-First Indexing 14m
- 040 Structured Data & Schema Markup 17m
- 041 International SEO (hreflang) 19m
- 042 Pagination 12m
- 043 Faceted Navigation 26m
- 044 Duplicate Content 13m
- 045 Site Migrations 24m