Module 029 Intermediate 15 min read

robots.txt Deep Dive

Syntax, directives, common mistakes, blocking AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) — and the difference between blocking and noindexing.

By SEO Mastery Editorial

robots.txt is a 30-year-old plain text file at the root of your domain that tells well-behaved crawlers where they may and may not go. It is not a security mechanism — anyone can ignore it — but every major search and AI crawler does honor it. In 2026, it is your first line of editorial control over whether GPTBot, ClaudeBot, PerplexityBot, and Google-Extended can read your work.

TL;DR

  • Disallow blocks crawling, not indexing. A blocked URL can still appear in the index from external links — Google just cannot describe it. To remove a URL from the index, use noindex (and let the page be crawlable so Google can see it).
  • The 2024 REP standard formalized syntax that was previously informal. User-agent, Allow, Disallow, and Sitemap are the only universally supported directives. Crawl-delay is honored by Bing and Yandex; Google ignores it.
  • AI crawler control is now a CEO-level conversation. Block GPTBot/ClaudeBot/PerplexityBot/Google-Extended/Applebot-Extended and you stay out of training and AI summaries. Allow them and you might earn citations. Pick a position deliberately.

The mental model

robots.txt is the bouncer’s clipboard at the front door. Every crawler that respects the protocol checks the clipboard before entering. The bouncer is honest but not aggressive — there is no enforcement beyond the social contract. Bots that respect the clipboard (Googlebot, Bingbot, GPTBot, ClaudeBot) follow it. Bots that do not are not stopped by the file, only by other systems (WAFs, rate limits, user-agent blocks).

The clipboard’s grammar is simple: a list of crawler names (User-agent), a list of paths each may not enter (Disallow), exceptions to those paths (Allow), and a list of sitemap URLs (Sitemap). Specificity matters — the longest matching rule wins for Google, while Bing uses first-match order.

The most common misunderstanding: people use Disallow thinking it removes a URL from the index. It does not. It only prevents future crawls. To deindex, you need the page to remain crawlable so Google can read the noindex directive. Block-and-noindex is incompatible.

Deep dive: the 2026 reality

The Robots Exclusion Protocol was formalized as RFC 9309 in September 2022. The standard codified what was already common practice — case-insensitive directive names, longest-match semantics for Allow/Disallow, support for * and $ wildcards in Google’s implementation.

The current crawler landscape every site owner should know:

CrawlerUser-agentPurposeHonors robots.txtJS execution
GooglebotGooglebotGoogle Search indexYesYes (Chrome 124+)
BingbotbingbotBing index, ChatGPT Search, CopilotYesYes (limited)
Google-ExtendedGoogle-ExtendedGemini training, AI Overviews inputYes (separate token)N/A — controls usage
GPTBotGPTBotOpenAI trainingYesNo
OAI-SearchBotOAI-SearchBotChatGPT Search retrievalYesNo
ChatGPT-UserChatGPT-UserUser-initiated browsing in ChatGPTYesYes
ClaudeBotClaudeBotAnthropic training and Claude with webYesNo
Claude-UserClaude-UserUser-initiated Claude searchesYesNo
PerplexityBotPerplexityBotPerplexity indexYesLimited
Perplexity-UserPerplexity-UserUser-initiated Perplexity fetchesDisputed (2024 Wired story)Yes
ApplebotApplebotSpotlight, SiriYesYes
Applebot-ExtendedApplebot-ExtendedApple Intelligence trainingYesN/A
CCBotCCBotCommon Crawl (used as training input)YesNo
AmazonbotAmazonbotAlexa, Amazon searchYesNo
BytespiderBytespiderByteDance / Doubao trainingOften ignoredNo
DuckDuckBotDuckDuckBotDuckDuckGoYesNo

Two ongoing 2026 controversies worth naming. Perplexity’s user-initiated crawler (Perplexity-User) was caught ignoring robots.txt in a June 2024 Wired investigation; Perplexity argued user-fetches are not crawls. Cloudflare introduced AI-bot blocking at the edge in response. Google-Extended does not block training of older Google AI models retroactively, only future use.

Visualizing it

flowchart TD
  A[Bot requests page] --> B{robots.txt fetched?}
  B -->|No| C[Bot fetches /robots.txt]
  C --> B
  B -->|Yes, cached| D{Match user-agent token}
  D --> E{Path matches Disallow?}
  E -->|No| F[Crawl allowed]
  E -->|Yes| G{Path matches Allow that's longer?}
  G -->|Yes| F
  G -->|No| H[Crawl blocked]
  F --> I[Fetch page]
  I --> J{noindex meta or X-Robots-Tag?}
  J -->|Yes| K[Crawled, not indexed]
  J -->|No| L[Eligible for index]

Bad vs. expert

The bad approach

Two common bad patterns. First, the panic block — the team wants to remove a section from Google, so they Disallow it:

# example.com/robots.txt — WRONG
User-agent: *
Disallow: /private/
Disallow: /old-promo-2023/
Disallow: /admin/

Six weeks later, site:example.com inurl:old-promo-2023 still shows results, because external links to those URLs make Google list them in the index — without descriptions — even though Google cannot crawl them. The fix the team wanted required a noindex (which requires crawlability), not a block.

Second, the accidental sitewide block — a developer copies a staging robots.txt to production:

# DO NOT DEPLOY
User-agent: *
Disallow: /

This single line, accidentally pushed, deindexes an entire site within days. Real-world cases: Asos in 2015 (10-day outage cost ~£100M in lost organic), and the November 2023 incident where a major US news site blocked / for 18 hours.

The expert approach

A defensible 2026 production robots.txt:

# example.com/robots.txt
# Last reviewed: 2026-04-15

# All search crawlers: allow most paths, block noisy infrastructure
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /cart/
Disallow: /account/
Allow: /api/og-image/$

# Googlebot specific (longest-match wins for Google)
User-agent: Googlebot
Disallow: /admin/
Disallow: /api/
Allow: /api/og-image/

# AI training crawlers — editorial decision: blocked
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# AI search retrieval crawlers — allowed for citations
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Why this works: faceted URL parameters (?sort=, ?filter=) and infinite-space paths are blocked to preserve crawl budget. AI training is denied while AI search retrieval is permitted — meaning content can earn citations in ChatGPT Search and Perplexity but is not used to train future models. Sitemaps are listed at the bottom for crawler discovery.

To deindex /old-promo-2023/ properly, you would not block it in robots. You would let Googlebot crawl it and serve a noindex:

# nginx config for the deindex case
location /old-promo-2023/ {
  add_header X-Robots-Tag "noindex" always;
  try_files $uri $uri/ =404;
}

After Googlebot recrawls and processes the noindex (typically 2–4 weeks for sites on a daily crawl cadence), the URLs drop from the index. Once they are gone, you can choose to 410 them, redirect them, or block them.

Do this today

  1. Visit https://yourdomain.com/robots.txt and read it line by line. If the file does not exist, your server returns 404 — not catastrophic, but worth fixing for crawl efficiency. Create a minimal one with User-agent: * and your sitemap URL.
  2. Test against Google’s robots.txt Tester (now inside GSC > Settings > Crawling > robots.txt). Click any URL on your site and confirm the verdict matches your intent. Any Disallowed revenue URL is a five-alarm fire.
  3. Test against Bing Webmaster Tools > Configure > robots.txt Tester as well. Bing applies first-match order rather than longest-match — what works for Google can mis-fire on Bing.
  4. List every AI crawler user-agent you have a position on. Decide explicitly: is your content for training? For retrieval/citation only? For neither? Document the answer. The decisions should be made by the editorial lead, not the developer.
  5. If blocking AI training, add explicit User-agent blocks for GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Bytespider. Confirm with OpenAI’s GPTBot documentation that they updated their crawler IP ranges within 24 hours of your change.
  6. Audit for Crawl-delay directives. Google ignores them; Bing/Yandex honor. If yours is set high (e.g. Crawl-delay: 30), Bing crawls less, which slows ChatGPT Search and Copilot indexation. Lower it or remove it unless you have host-load reasons.
  7. Run a Screaming Frog crawl with Configuration > robots.txt > Settings set to Respect robots.txt, then re-crawl with Ignore robots.txt. Diff the two URL lists. Any URL in the second list but not the first is currently blocked — confirm intent.
  8. Set up GSC alerts for Settings > Crawl Stats > robots.txt fetch errors. A robots.txt that 5xx’s is treated as “fetch unsuccessful” and Google may stop crawling temporarily.
  9. Add a CI check in your deploy pipeline that fails if robots.txt contains the line Disallow: / for User-agent: *. One regex prevents the entire failure mode.

Mark complete

Toggle to remember this module as mastered. Saved to your browser only.

More in this part

Part 5: Technical SEO

View all on the home page →
  1. 026 Technical SEO Fundamentals 12m
  2. 027 Site Architecture 20m
  3. 028 Crawling & Indexing 17m
  4. 029 robots.txt Deep Dive You're here 15m
  5. 030 XML Sitemaps 12m
  6. 031 Canonical Tags 20m
  7. 032 Meta Robots & X-Robots-Tag 13m
  8. 033 HTTP Status Codes 15m
  9. 034 Crawl Budget Management 16m
  10. 035 JavaScript SEO 26m
  11. 036 Core Web Vitals 17m
  12. 037 Site Speed & Performance 19m
  13. 038 HTTPS & Site Security 12m
  14. 039 Mobile SEO & Mobile-First Indexing 14m
  15. 040 Structured Data & Schema Markup 17m
  16. 041 International SEO (hreflang) 19m
  17. 042 Pagination 12m
  18. 043 Faceted Navigation 26m
  19. 044 Duplicate Content 13m
  20. 045 Site Migrations 24m