Crawling & Indexing

Crawling and indexing are two different jobs done by two different systems, and conflating them is the root cause of most “why isn’t my page ranking” tickets. Googlebot crawls. The Caffeine indexing pipeline (now part of the unified search infrastructure) decides whether to keep what was crawled. A page can be crawled a thousand times and still never indexed.

TL;DR

Crawling fetches; indexing decides to keep. “Discovered – currently not indexed” and “Crawled – currently not indexed” mean Google chose to skip your page on quality, duplication, or budget grounds — not that it failed to find it.
You cannot force Google to index. You can make indexation more likely. Strong internal links, server-rendered HTML, fresh lastmod in the sitemap, and proven topical authority move the needle. Submitting a sitemap 30 times does not.
IndexNow is real and matters for Bing, Yandex, Naver, and ChatGPT Search. Google does not consume IndexNow (officially), but Bing-driven AI surfaces increasingly do.

The mental model

Crawling and indexing are like a librarian’s intake desk and acquisition committee. The intake desk (Googlebot) accepts every book that arrives — opens it, scans the spine, takes notes. The acquisition committee (the indexing pipeline) reads the notes and decides whether the library actually wants to keep this book on a shelf where patrons can find it.

Books that fail acquisition have no shelf. The librarian remembers receiving the book (“Discovered” or “Crawled”), but it is not in the catalog. Patrons asking for it get nothing.

In 2026, the acquisition committee is far stricter than it was in 2018. The Helpful Content signal (now a sitewide quality input) means a book in a low-quality library faces an uphill battle even if the book itself is fine. Many sites with 100,000 URLs have only 30,000–50,000 indexed because the rest fall below an internal quality threshold.

Deep dive: the 2026 reality

Google’s index is not infinite, and the team has been explicit about this since John Mueller’s 2022 confirmation. Most large sites operate at an index-to-crawled ratio of 35–70%. The remainder live in a state Google classifies as one of:

GSC reason	What happened	What to do
Discovered – currently not indexed	Google found the URL but did not crawl it	Improve internal linking, reduce host load, demonstrate value
Crawled – currently not indexed	Google crawled but rejected the content	Improve content quality, deduplicate, add unique value
Duplicate, Google chose different canonical	Crawled, deemed duplicate of another URL	Audit canonicals, consolidate or differentiate
Excluded by ‘noindex’ tag	You told Google not to index	Confirm intentional
Page with redirect	URL is a redirect; index target instead	Confirm correct
Soft 404	200 OK but content reads like a 404	Return real 404 or beef up content
Blocked by robots.txt	Cannot crawl	Confirm intentional

The URL Inspection tool in GSC is the ground-truth instrument. It tells you the last crawl date, the declared canonical vs. the Google-selected canonical, the rendered HTML, and the resources blocked. Live test re-fetches with a fresh Googlebot.

IndexNow (indexnow.org) is a 2021-launched protocol where you POST a URL to bing.com/indexnow (or partner endpoints) when content changes. Bing, Yandex, Naver, Seznam, Yep, and DuckDuckGo consume the feed. OAI-SearchBot and PerplexityBot do not consume IndexNow directly, but ChatGPT Search and Copilot are downstream of Bing’s index, so a fast IndexNow ping reaches them within hours. Google still relies on its own crawl, with Gary Illyes confirming in late 2024 that Google has no IndexNow consumption plans.

The AI crawler indexing question is separate. GPTBot indexes for OpenAI training and ChatGPT. OAI-SearchBot specifically powers ChatGPT Search retrieval. Google-Extended controls whether Google can use your content for Gemini training and AI Overviews generation. Applebot-Extended controls Apple Intelligence training. Allowing or blocking these is an editorial decision, not a technical one — see Module 29 for syntax.

Visualizing it

sequenceDiagram
  participant Site
  participant Googlebot
  participant Indexer
  participant Index
  participant SERP
  Site->>Googlebot: URL discovered (sitemap, link, IndexNow*)
  Googlebot->>Site: Fetch HTML
  Site-->>Googlebot: 200 OK + HTML
  Googlebot->>Indexer: Hand off content
  Indexer->>Indexer: Render JS, parse, dedupe, score
  alt Quality and uniqueness pass
    Indexer->>Index: Store
    Index->>SERP: Eligible to rank
  else Quality fail or dup
    Indexer->>Index: Reject
    Index--xSERP: Crawled, not indexed
  end

Bad vs. expert

The bad approach

The classic panic move: when a page is not indexed, the team submits the URL to GSC’s URL Inspection > Request Indexing button repeatedly, then submits the sitemap five times, then files a Reddit post.

# What junior SEOs do at 2am
1. /technical-seo/canonicals/ -> Request Indexing -> wait
2. (3 hours later) -> Request Indexing -> still nothing
3. Edit sitemap.xml lastmod to 'now' -> resubmit
4. Tweet at @googlesearchc

This addresses none of the actual reasons. If Google’s classifier rejected the page on quality, requesting indexing 30 times will not change the verdict. The Request Indexing button has a daily quota (around 10–12 URLs per property in 2026) and is meant for genuinely new or substantially changed pages, not as a workaround for systemic issues.

The expert approach

The expert diagnoses why first, then fixes the cause. The diagnostic flow:

# 1. URL Inspection live test in GSC
# Confirm: indexability, canonical, render, robots
# 2. View source vs. rendered DOM diff
curl -s -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
  https://example.com/technical-seo/canonicals/ \
  | grep -E '(canonical|robots|<title>|<h1>)'

# 3. Check IndexNow propagation for Bing/ChatGPT Search
curl -X POST "https://api.indexnow.org/IndexNow" \
  -H "Content-Type: application/json; charset=utf-8" \
  -d '{
    "host": "example.com",
    "key": "9a1b3c5d7e9f1a3b5c7d9e1f3a5b7c9d",
    "keyLocation": "https://example.com/9a1b3c5d7e9f1a3b5c7d9e1f3a5b7c9d.txt",
    "urlList": [
      "https://example.com/technical-seo/canonicals/",
      "https://example.com/technical-seo/robots-txt/"
    ]
  }'

The fix list, in priority order:

Fix any noindex directive that should not be there (check <meta name="robots"> and the X-Robots-Tag HTTP header).
Fix the canonical if the Google-selected canonical differs from your declared one.
Add at least 3 high-quality internal links from topically related, already-indexed pages.
Verify the page renders in headless Chrome (the URL Inspection live test exposes this). If JS gates the content, switch to SSR or pre-render.
Update lastmod in the sitemap honestly — Google ignores fabricated dates and Bing has started flagging them as spam signals since 2024.
Improve the content if it is genuinely thin — combine with related content, add unique data, remove boilerplate.

For high-velocity content (news, ecom inventory, programmatic), wire up IndexNow on publish/update. The API key is a simple text file at the domain root; the POST happens in your CMS hook.

Do this today

Open GSC > Indexing > Pages. Note your Indexed count, your Not indexed count, and click each “Why pages aren’t indexed” reason. Export each reason’s URL sample to CSV.
For your top 20 revenue URLs (cross-reference GA4 pagePath with conversion events), run URL Inspection one by one. For each, confirm: Page is indexed, User-declared canonical = Google-selected canonical, and Last crawl within 30 days.
For any URL marked Crawled – currently not indexed, click Test Live URL. Compare the HTML tab (rendered) to your source view. If content is missing in the rendered HTML, your problem is JavaScript rendering — see Module 35.
Generate a fresh XML sitemap with honest <lastmod> dates (the actual database updatedAt, not Date.now()). Submit to GSC > Sitemaps. Re-submit only when the sitemap content materially changes.
Set up IndexNow. Generate a 32-char key, place it as https://example.com/<key>.txt containing only the key string, and add a POST hook in your CMS that fires on publish/update. Test with curl against api.indexnow.org/IndexNow.
Identify pages with zero internal links using Screaming Frog > Internal > Inlinks sorted ascending. Add at least one contextual link from a topically related parent page for each orphan you intend to keep.
In Bing Webmaster Tools > Sitemaps, submit the same sitemap. Bing’s crawler is more aggressive and faster to index than Googlebot for new content; ChatGPT Search retrieval pipes through Bing.
Schedule a monthly indexation health check: track the Indexed/Total ratio over time. A declining ratio means Google is rejecting more of your URLs each month — quality intervention is needed.
For any URL still Discovered – currently not indexed after 60 days, audit content depth. If the page brings nothing unique to the topic, either consolidate it via 301 into a stronger page or remove it.