Crawling & Indexing
How Google discovers pages, the crawl-vs-index distinction, URL Inspection, Coverage report mastery, and forcing indexation (with realistic limits).
Crawling and indexing are two different jobs done by two different systems, and conflating them is the root cause of most “why isn’t my page ranking” tickets. Googlebot crawls. The Caffeine indexing pipeline (now part of the unified search infrastructure) decides whether to keep what was crawled. A page can be crawled a thousand times and still never indexed.
TL;DR
- Crawling fetches; indexing decides to keep. “Discovered – currently not indexed” and “Crawled – currently not indexed” mean Google chose to skip your page on quality, duplication, or budget grounds — not that it failed to find it.
- You cannot force Google to index. You can make indexation more likely. Strong internal links, server-rendered HTML, fresh
lastmodin the sitemap, and proven topical authority move the needle. Submitting a sitemap 30 times does not. IndexNowis real and matters for Bing, Yandex, Naver, and ChatGPT Search. Google does not consumeIndexNow(officially), but Bing-driven AI surfaces increasingly do.
The mental model
Crawling and indexing are like a librarian’s intake desk and acquisition committee. The intake desk (Googlebot) accepts every book that arrives — opens it, scans the spine, takes notes. The acquisition committee (the indexing pipeline) reads the notes and decides whether the library actually wants to keep this book on a shelf where patrons can find it.
Books that fail acquisition have no shelf. The librarian remembers receiving the book (“Discovered” or “Crawled”), but it is not in the catalog. Patrons asking for it get nothing.
In 2026, the acquisition committee is far stricter than it was in 2018. The Helpful Content signal (now a sitewide quality input) means a book in a low-quality library faces an uphill battle even if the book itself is fine. Many sites with 100,000 URLs have only 30,000–50,000 indexed because the rest fall below an internal quality threshold.
Deep dive: the 2026 reality
Google’s index is not infinite, and the team has been explicit about this since John Mueller’s 2022 confirmation. Most large sites operate at an index-to-crawled ratio of 35–70%. The remainder live in a state Google classifies as one of:
| GSC reason | What happened | What to do |
|---|---|---|
| Discovered – currently not indexed | Google found the URL but did not crawl it | Improve internal linking, reduce host load, demonstrate value |
| Crawled – currently not indexed | Google crawled but rejected the content | Improve content quality, deduplicate, add unique value |
| Duplicate, Google chose different canonical | Crawled, deemed duplicate of another URL | Audit canonicals, consolidate or differentiate |
| Excluded by ‘noindex’ tag | You told Google not to index | Confirm intentional |
| Page with redirect | URL is a redirect; index target instead | Confirm correct |
| Soft 404 | 200 OK but content reads like a 404 | Return real 404 or beef up content |
| Blocked by robots.txt | Cannot crawl | Confirm intentional |
The URL Inspection tool in GSC is the ground-truth instrument. It tells you the last crawl date, the declared canonical vs. the Google-selected canonical, the rendered HTML, and the resources blocked. Live test re-fetches with a fresh Googlebot.
IndexNow (indexnow.org) is a 2021-launched protocol where you POST a URL to bing.com/indexnow (or partner endpoints) when content changes. Bing, Yandex, Naver, Seznam, Yep, and DuckDuckGo consume the feed. OAI-SearchBot and PerplexityBot do not consume IndexNow directly, but ChatGPT Search and Copilot are downstream of Bing’s index, so a fast IndexNow ping reaches them within hours. Google still relies on its own crawl, with Gary Illyes confirming in late 2024 that Google has no IndexNow consumption plans.
The AI crawler indexing question is separate. GPTBot indexes for OpenAI training and ChatGPT. OAI-SearchBot specifically powers ChatGPT Search retrieval. Google-Extended controls whether Google can use your content for Gemini training and AI Overviews generation. Applebot-Extended controls Apple Intelligence training. Allowing or blocking these is an editorial decision, not a technical one — see Module 29 for syntax.
Visualizing it
sequenceDiagram
participant Site
participant Googlebot
participant Indexer
participant Index
participant SERP
Site->>Googlebot: URL discovered (sitemap, link, IndexNow*)
Googlebot->>Site: Fetch HTML
Site-->>Googlebot: 200 OK + HTML
Googlebot->>Indexer: Hand off content
Indexer->>Indexer: Render JS, parse, dedupe, score
alt Quality and uniqueness pass
Indexer->>Index: Store
Index->>SERP: Eligible to rank
else Quality fail or dup
Indexer->>Index: Reject
Index--xSERP: Crawled, not indexed
end
Bad vs. expert
The bad approach
The classic panic move: when a page is not indexed, the team submits the URL to GSC’s URL Inspection > Request Indexing button repeatedly, then submits the sitemap five times, then files a Reddit post.
# What junior SEOs do at 2am
1. /technical-seo/canonicals/ -> Request Indexing -> wait
2. (3 hours later) -> Request Indexing -> still nothing
3. Edit sitemap.xml lastmod to 'now' -> resubmit
4. Tweet at @googlesearchc
This addresses none of the actual reasons. If Google’s classifier rejected the page on quality, requesting indexing 30 times will not change the verdict. The Request Indexing button has a daily quota (around 10–12 URLs per property in 2026) and is meant for genuinely new or substantially changed pages, not as a workaround for systemic issues.
The expert approach
The expert diagnoses why first, then fixes the cause. The diagnostic flow:
# 1. URL Inspection live test in GSC
# Confirm: indexability, canonical, render, robots
# 2. View source vs. rendered DOM diff
curl -s -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
https://example.com/technical-seo/canonicals/ \
| grep -E '(canonical|robots|<title>|<h1>)'
# 3. Check IndexNow propagation for Bing/ChatGPT Search
curl -X POST "https://api.indexnow.org/IndexNow" \
-H "Content-Type: application/json; charset=utf-8" \
-d '{
"host": "example.com",
"key": "9a1b3c5d7e9f1a3b5c7d9e1f3a5b7c9d",
"keyLocation": "https://example.com/9a1b3c5d7e9f1a3b5c7d9e1f3a5b7c9d.txt",
"urlList": [
"https://example.com/technical-seo/canonicals/",
"https://example.com/technical-seo/robots-txt/"
]
}'
The fix list, in priority order:
- Fix any
noindexdirective that should not be there (check<meta name="robots">and theX-Robots-TagHTTP header). - Fix the canonical if the Google-selected canonical differs from your declared one.
- Add at least 3 high-quality internal links from topically related, already-indexed pages.
- Verify the page renders in headless Chrome (the URL Inspection live test exposes this). If JS gates the content, switch to SSR or pre-render.
- Update
lastmodin the sitemap honestly — Google ignores fabricated dates and Bing has started flagging them as spam signals since 2024. - Improve the content if it is genuinely thin — combine with related content, add unique data, remove boilerplate.
For high-velocity content (news, ecom inventory, programmatic), wire up IndexNow on publish/update. The API key is a simple text file at the domain root; the POST happens in your CMS hook.
Do this today
- Open GSC > Indexing > Pages. Note your Indexed count, your Not indexed count, and click each “Why pages aren’t indexed” reason. Export each reason’s URL sample to CSV.
- For your top 20 revenue URLs (cross-reference GA4
pagePathwith conversion events), run URL Inspection one by one. For each, confirm: Page is indexed, User-declared canonical = Google-selected canonical, and Last crawl within 30 days. - For any URL marked Crawled – currently not indexed, click Test Live URL. Compare the HTML tab (rendered) to your source view. If content is missing in the rendered HTML, your problem is JavaScript rendering — see Module 35.
- Generate a fresh XML sitemap with honest
<lastmod>dates (the actual databaseupdatedAt, notDate.now()). Submit to GSC > Sitemaps. Re-submit only when the sitemap content materially changes. - Set up IndexNow. Generate a 32-char key, place it as
https://example.com/<key>.txtcontaining only the key string, and add a POST hook in your CMS that fires on publish/update. Test withcurlagainstapi.indexnow.org/IndexNow. - Identify pages with zero internal links using Screaming Frog > Internal > Inlinks sorted ascending. Add at least one contextual link from a topically related parent page for each orphan you intend to keep.
- In Bing Webmaster Tools > Sitemaps, submit the same sitemap. Bing’s crawler is more aggressive and faster to index than Googlebot for new content; ChatGPT Search retrieval pipes through Bing.
- Schedule a monthly indexation health check: track the Indexed/Total ratio over time. A declining ratio means Google is rejecting more of your URLs each month — quality intervention is needed.
- For any URL still Discovered – currently not indexed after 60 days, audit content depth. If the page brings nothing unique to the topic, either consolidate it via 301 into a stronger page or remove it.
Mark complete
Toggle to remember this module as mastered. Saved to your browser only.
More in this part
Part 5: Technical SEO
- 026 Technical SEO Fundamentals 12m
- 027 Site Architecture 20m
- 028 Crawling & Indexing You're here 17m
- 029 robots.txt Deep Dive 15m
- 030 XML Sitemaps 12m
- 031 Canonical Tags 20m
- 032 Meta Robots & X-Robots-Tag 13m
- 033 HTTP Status Codes 15m
- 034 Crawl Budget Management 16m
- 035 JavaScript SEO 26m
- 036 Core Web Vitals 17m
- 037 Site Speed & Performance 19m
- 038 HTTPS & Site Security 12m
- 039 Mobile SEO & Mobile-First Indexing 14m
- 040 Structured Data & Schema Markup 17m
- 041 International SEO (hreflang) 19m
- 042 Pagination 12m
- 043 Faceted Navigation 26m
- 044 Duplicate Content 13m
- 045 Site Migrations 24m