Meta Robots & X-Robots-Tag

The <meta name="robots"> tag and the X-Robots-Tag HTTP header are how you tell crawlers what to do after they fetch a page — index it or not, follow links or not, show a snippet or not, cache or not. They are directives, not hints. When set correctly, Google obeys.

TL;DR

noindex is the right tool to remove a URL from the index. Disallow in robots.txt blocks crawling but leaves URL listings in the index. Use noindex and keep the page crawlable so Google can read the directive.
X-Robots-Tag HTTP header lets you control non-HTML resources. PDFs, images, CSV downloads, video files cannot carry meta tags — but they can carry HTTP headers. Same syntax, different transport.
max-snippet, max-image-preview, and max-video-preview matter more in 2026 because they constrain how Google AI Overviews and Bing Copilot can summarize your content. Setting max-snippet:-1 is the implicit consent.

The mental model

Meta robots and X-Robots-Tag are the page’s instructions to the visiting librarian after the librarian has read the book. The book exists; the visit happened. The instructions say: “do not catalog this”, or “catalog it but do not show a preview”, or “do not list any links from this page”. The librarian honors these instructions for any well-behaved member of the protocol — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot.

The <meta name="robots"> tag goes in the <head> of an HTML document. The X-Robots-Tag is the same syntax served as an HTTP response header — invisible in the rendered page but present in every response. They are functionally equivalent for HTML; only the header works for non-HTML files.

Deep dive: the 2026 reality

The full directive vocabulary supported by Google in 2026:

Directive	Effect
`index` / `noindex`	Allow or block index inclusion
`follow` / `nofollow`	Pass or block link equity from this page’s outbound links
`noarchive`	Do not show the cached link
`nosnippet`	Do not show a text snippet or video preview
`max-snippet:N`	Limit text snippet to N characters; `-1` = no limit
`max-image-preview:none\|standard\|large`	Limit image preview size
`max-video-preview:N`	Limit video preview to N seconds; `-1` = no limit
`notranslate`	Do not offer Google Translate
`noimageindex`	Do not index images on this page
`unavailable_after:DATE`	Drop from index after a specific date (RFC 850 or ISO 8601)
`indexifembedded`	Index when embedded in a parent page (used with `noindex` on the embedded page)

Per-crawler targeting works by replacing robots with the specific bot name. Google’s documented tokens are googlebot, googlebot-news, googlebot-image, google-extended, otherbot (catch-all). Bing accepts bingbot. AI crawlers do not yet have widely-honored per-crawler meta directives — control them via robots.txt.

Google-Extended is unusual: it is a robots.txt token only, not a meta robots token. To opt out of Gemini training and AI Overviews input on a per-page basis, you cannot use <meta name="google-extended" content="noindex"> — that is not a recognized directive. Use robots.txt for sitewide control or live with the binary.

The 2026 reality on AI snippets: Google’s AI Overviews respect nosnippet and max-snippet. Setting max-snippet:0 removes your page from AI Overview citations. Setting max-snippet:-1 (or omitting the directive) is implicit consent to be summarized. PerplexityBot and OAI-SearchBot do not currently honor max-snippet — they read your full page anyway because they fetch on user query rather than pre-cache.

Visualizing it

flowchart TD
  A[Crawler fetches URL] --> B{HTTP response headers}
  B --> C{X-Robots-Tag present?}
  C -->|Yes, noindex| Z[Drop from index]
  C -->|Yes, other| D[Apply directives]
  C -->|No| E[Parse HTML head]
  E --> F{meta name=robots?}
  F -->|Yes, noindex| Z
  F -->|Yes, other| D
  F -->|No| G[Default: index, follow, max-snippet:-1]
  D --> H[Index with constraints]
  G --> H

Bad vs. expert

The bad approach

Two failure patterns. First, putting noindex in robots.txt (a now-removed nonstandard Google extension):

# robots.txt — DOES NOT WORK
User-agent: *
Noindex: /private/

Google removed support for Noindex: in robots.txt on September 1, 2019. It still appears in legacy configs and silently does nothing. The team thinks they have deindexed /private/; they have not.

Second, blocking /private/ in robots.txt and adding noindex to the page:

# robots.txt
User-agent: *
Disallow: /private/

<!-- on /private/something -->
<meta name="robots" content="noindex">

This is contradictory: Google cannot crawl the page (blocked) and therefore cannot read the noindex directive. The URL stays in the index — Google will display the URL with the message “A description for this result is not available because of this site’s robots.txt” — for as long as external links point to it.

Third, using noindex on paginated category pages (page 2, page 3, etc.):

<!-- on /blog?page=2 -->
<meta name="robots" content="noindex,follow">

Google’s John Mueller confirmed in 2017 (and again in 2024) that noindex,follow long-term degrades to noindex,nofollow. Google reasonably concludes a permanently noindexed page is a low-value source of link signal. Use self-canonicals for pagination, not noindex.

The expert approach

For a page you want deindexed, serve noindex via meta tag (HTML pages) or X-Robots-Tag header (non-HTML or universal):

<!-- HTML page deindexing -->
<head>
  <meta name="robots" content="noindex,nofollow">
</head>

For PDFs, downloads, or CSV files, set the header server-side. Nginx:

location ~* \.(pdf|csv|xls|xlsx)$ {
  add_header X-Robots-Tag "noindex, nosnippet" always;
  try_files $uri =404;
}

# Or for a specific path
location /internal/ {
  add_header X-Robots-Tag "noindex, nofollow" always;
}

For granular control, set max-snippet, max-image-preview, and max-video-preview:

<!-- Allow text snippet up to 160 chars, large image previews -->
<meta name="robots" content="max-snippet:160, max-image-preview:large, max-video-preview:-1">

Per-crawler differentiation — block Google News from indexing while allowing Google Search:

<meta name="googlebot-news" content="noindex">
<meta name="googlebot" content="index, follow">

The unavailable_after directive for time-sensitive content (limited promotions, expiring events):

<!-- Drop this URL from the index after the date passes -->
<meta name="robots" content="unavailable_after: 2026-12-31T23:59:59Z">

For AI surface control on individual pages, combine max-snippet:0 with allowing crawl:

<!-- Page is indexable but cannot be summarized in AI Overviews -->
<meta name="robots" content="index, follow, max-snippet:0, noarchive">

To deindex a category sitewide, the X-Robots-Tag at the response level is cleaner than touching every template:

location /staff-only/ {
  add_header X-Robots-Tag "noindex, nofollow" always;
  proxy_pass http://upstream;
}

Verify the header is actually present:

curl -I https://example.com/staff-only/dashboard \
  | grep -i x-robots-tag
# Expected: X-Robots-Tag: noindex, nofollow

Do this today

Audit all current noindex directives. In Screaming Frog SEO Spider, filter Indexability > Non-Indexable and review every URL. Confirm each one should be noindexed; mistakes here are common.
Search your codebase for name="robots" and X-Robots-Tag. Catalog every place a directive is set. Decentralized robots logic is the #1 cause of accidental sitewide deindexation.
For each URL marked noindex, verify it is not also blocked in robots.txt. Use GSC’s robots.txt Tester (under Settings > Crawling) to confirm. If both are set, lift the robots block first so the noindex can be processed.
Inspect HTTP headers for non-HTML downloads. curl -I your top 10 PDFs, image assets, and CSV files. If they should not be indexed, add X-Robots-Tag: noindex at the server level.
In GSC > URL Inspection, run the live test on a noindexed URL. Confirm Indexing allowed? says No: ‘noindex’ detected in ‘robots’ meta tag (or the header equivalent). If it says Yes, your directive is not being served.
Set max-image-preview:large on every public content page. This unlocks larger image previews in Google Discover and AI Overviews — typically a 10–20% CTR lift on Discover-eligible content.
Audit per-crawler directives. Search for googlebot-news, google-extended, and any custom user-agent meta tags. Document the editorial rationale for each.
Add a CI test that fetches your homepage and key templates, parses headers + meta robots, and asserts index, follow (or your intended values). Catch regressions before they ship.
For URLs you want fully removed from the index quickly, use GSC > Removals > New Request > Temporarily remove URL after serving noindex. The temporary removal hides the URL for ~6 months while Google’s recrawl picks up the permanent directive.