Module 003 Intermediate 14 min read

How Search Engines Work

The five-stage search pipeline: Discovery, Crawling, Rendering, Indexing, Ranking. Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, RankBrain, BERT, MUM, Gemini, index tiers.

By SEO Mastery Editorial

Every SEO decision you will ever make sits on top of one pipeline: Discovery, Crawling, Rendering, Indexing, Ranking. If you do not understand which stage a problem lives in, you will fix the wrong thing. A page that is not indexed cannot be ranked, and a page that is in the rendering tier cannot be in the priority index — fixing rank when the actual problem is rendering wastes a quarter.

TL;DR

  • Search is a five-stage pipeline. Discovery feeds Crawling feeds Rendering feeds Indexing feeds Ranking. Every “why isn’t this ranking” question maps to a specific stage and you debug by walking the pipeline in order.
  • The crawler population exploded. Googlebot is no longer alone. GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, meta-externalagent, and CCBot all hit your origin in 2026 — each with different access patterns, intent, and impact on your bandwidth bill.
  • Ranking is multi-system, not one algorithm. Google’s ranking is the orchestrated output of RankBrain (2015), BERT (2019), MUM (2021), SpamBrain, the Helpful Content system (now in core), Twiddlers (re-rankers), and Gemini-powered AI Overview generation — all reading the same index but applying different signals.

The mental model

A search engine is like a city library that gets a million new books every hour. Discovery is the postal worker who notices a book exists. Crawling is the courier who picks it up. Rendering is the desk where the cover is opened and the contents are actually read. Indexing is the cataloger who decides which shelf the book belongs on, what subject tags it gets, and whether it is worth shelving at all. Ranking is the librarian who, when a patron asks a question, walks to the right shelf and chooses which book to recommend first.

Every stage has limits. The postal worker can only notice books they get notified about (sitemaps, links, IndexNow). The courier has a daily route capacity (crawl budget). The reading desk has finite chairs (rendering tier capacity, especially for JavaScript-heavy pages). The cataloger has shelf limits (the priority index versus the “discovered, not indexed” purgatory). The librarian’s recommendations change based on the patron, the time of day, and whether a new librarian-in-training (a Twiddler, an AI Overview generator) wants to chime in.

The pipeline metaphor matters because debugging the wrong stage is the most expensive mistake in SEO. “I rewrote the page three times and it still doesn’t rank” usually turns out to be “the page was never indexed because robots.txt blocked it” or “the page was never rendered because your React app crashes for Googlebot’s WebRendering Service.”

Deep dive: the 2026 reality

Stage 1 — Discovery. Google learns about a URL through five channels: external backlinks (the original signal), XML sitemaps, internal links from already-known URLs, IndexNow pings (Bing-led, Google does not officially honor it), and the GSC URL Inspection > “Request Indexing” button (rate-limited to ~10/day). For new sites, the strongest discovery signal is a backlink from an already-crawled site — a sitemap alone for a brand-new domain typically takes 4-14 days to produce a first crawl.

Stage 2 — Crawling. Googlebot fetches the URL. Crawl budget is governed by crawl rate limit (how many parallel connections your server can handle without 5xx errors) and crawl demand (how much Google wants the URL based on freshness, popularity, and depth). For sites under ~10K URLs crawl budget is rarely the bottleneck. Above ~100K URLs it is the dominant constraint. The other crawlers in 2026:

CrawlerOperatorPurposeUA string contains
GooglebotGoogleSearch indexGooglebot
BingbotMicrosoftBing indexbingbot
Google-ExtendedGoogleGemini training, AI Overviews extractionGoogle-Extended
GPTBotOpenAITraining dataGPTBot
OAI-SearchBotOpenAIChatGPT Search live retrievalOAI-SearchBot
ClaudeBotAnthropicClaude training and web searchClaudeBot
PerplexityBotPerplexityPerplexity indexPerplexityBot
BytespiderByteDanceTikTok / DoubaoBytespider
meta-externalagentMetaLlama, Meta AImeta-externalagent
CCBotCommon CrawlPublic dataset (used by many LLMs)CCBot

Stage 3 — Rendering. Googlebot’s renderer is a headless Chromium kept evergreen with stable Chrome (currently Chrome 124+ as of April 2026). JavaScript is rendered, but in a second wave with a delay — typically minutes to days behind initial crawl. AI crawlers are inconsistent: OAI-SearchBot does render JS as of late 2025, PerplexityBot does, but GPTBot and ClaudeBot historically have not consistently executed client-side JavaScript. If your content depends on JS to render, you are betting on a smaller crawler population.

Stage 4 — Indexing. Google does not have one index — it has tiers. The leaked Content Warehouse documents (May 2024) confirmed at least three: a fast-access primary tier (flash storage), a slower secondary tier, and long-tail storage. Pages with low click-through, low link-equity, or low Helpful Content scores get demoted to the slower tiers and become essentially invisible for competitive queries. The “discovered — currently not indexed” status in GSC means the URL was crawled but not catalogued.

Stage 5 — Ranking. A query triggers a multi-system response:

  • Query understanding — the query is parsed by NLP systems including RankBrain (2015, query-intent embeddings), BERT (2019, bidirectional context), MUM (2021, multimodal multi-task, 1000x more powerful than BERT per Google), and Neural Matching.
  • Initial retrieval — candidate documents are pulled from the index using inverted indexes plus vector similarity.
  • Ranking — candidates are scored by hundreds of signals: PageRank-derived link signals, on-page relevance, freshness, location, language, Core Web Vitals (LCP, INP, CLS), and the Helpful Content classifier.
  • Re-ranking and Twiddlers — final adjustments by specialized re-rankers, including Navboost (click-based), freshness boosts, and demotion Twiddlers for spam, exact-match domains, and low-quality.
  • AI generation — for queries that trigger an AI Overview or AI Mode, a Gemini-class model summarizes the top results with inline citations.

The 2026 wrinkle is that AI Overviews ride on top of the same ranking system but pick differently. The cited sources in an AI Overview are not always the top 10 — Google’s grounding model favors sources with clean entity markup, schema.org coverage, and answer-first paragraph structure even if their classical rank is page 2.

Visualizing it

flowchart TD
  D["Discovery (sitemap, links, IndexNow)"] --> C["Crawling (Googlebot, GPTBot, ClaudeBot)"]
  C --> R["Rendering (headless Chromium, second wave)"]
  R --> I["Indexing (priority tier, secondary, long-tail)"]
  I --> Q["Query received"]
  Q --> NLP["Query understanding (RankBrain, BERT, MUM)"]
  NLP --> RT["Retrieval (inverted index + vectors)"]
  RT --> RK["Ranking (signals + Helpful Content)"]
  RK --> TW["Twiddlers (Navboost, freshness, demotion)"]
  TW --> AI["AI Overview generation (Gemini)"]
  TW --> SERP["Classic SERP"]
  AI --> SERP

Bad vs. expert

The bad approach

// Single-page React app, no SSR, no prerendering
import { useEffect, useState } from 'react';

export default function ProductPage({ id }) {
  const [product, setProduct] = useState(null);
  useEffect(() => {
    fetch(`/api/products/${id}`).then(r => r.json()).then(setProduct);
  }, [id]);
  if (!product) return <div>Loading...</div>;
  return <article>{product.title}</article>;
}

This fails at the rendering stage. Googlebot will queue it for the second-wave renderer, but GPTBot and ClaudeBot will see only <div>Loading...</div> and have nothing to index. The page may rank in Google after the second wave, but it will never appear in a ChatGPT Search citation. Multiply this across 50,000 product URLs and you have a phantom catalog.

The expert approach

// Next.js 15 App Router with static generation + ISR
// app/products/[id]/page.tsx
export const revalidate = 3600; // 1 hour

export async function generateStaticParams() {
  const products = await getAllProducts();
  return products.map(p => ({ id: p.slug }));
}

export async function generateMetadata({ params }) {
  const product = await getProduct(params.id);
  return {
    title: `${product.name} | Brand`,
    description: product.summary,
    alternates: { canonical: `https://brand.com/products/${product.slug}` }
  };
}

export default async function ProductPage({ params }) {
  const product = await getProduct(params.id);
  return (
    <article>
      <h1>{product.name}</h1>
      <p>{product.summary}</p>
      <script type="application/ld+json"
        dangerouslySetInnerHTML={{ __html: JSON.stringify({
          '@context': 'https://schema.org',
          '@type': 'Product',
          name: product.name,
          description: product.summary,
          offers: { '@type': 'Offer', price: product.price, priceCurrency: 'USD' }
        })}} />
    </article>
  );
}

This works because the HTML response contains the indexable content for every crawler, including the ones that do not run JS. Static generation with ISR gives you fresh data without per-request server cost. Schema.org markup feeds AI Overviews. Canonical tags resolve duplicate URL paths.

Do this today

  1. Open Google Search Console > Pages. Look at “Why pages aren’t indexed” — note the top three reasons. Each reason maps to a specific pipeline stage.
  2. Run Screaming Frog (free for ≤500 URLs) on your site with “Rendering: JavaScript” enabled. Compare the rendered word count vs. raw HTML word count. A delta > 30% means you have a rendering risk for non-JS crawlers.
  3. Open your robots.txt at https://your-domain.com/robots.txt. Verify each AI crawler is either explicitly allowed or your default User-agent: * line is permissive. The crawlers to check: Googlebot, Bingbot, Google-Extended, GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot.
  4. Use GSC’s URL Inspection tool on your most important URL > “Test live URL” > “View tested page” > “HTML.” Confirm the rendered HTML contains your H1 and primary content.
  5. In GSC > Settings > Crawl stats, check average response time. Above 800ms means crawl budget is being throttled by your server speed. Above 2000ms is a 5xx-risk zone.
  6. Submit an XML sitemap at /sitemap.xml via GSC > Sitemaps and via Bing Webmaster Tools > Sitemaps. Confirm both report a 2xx status.
  7. Run Bing Webmaster Tools > Site Explorer > IndexNow tab and verify your CMS or framework is sending IndexNow pings on publish. Astro, Next.js, and WordPress have official adapters.

Mark complete

Toggle to remember this module as mastered. Saved to your browser only.

More in this part

Part 1: Foundations

View all on the home page →
  1. 001 Welcome & Course Roadmap 11m
  2. 002 What SEO Actually Is in 2026 8m
  3. 003 How Search Engines Work You're here 14m
  4. 004 Anatomy of a Modern SERP 9m
  5. 005 The 4 Pillars of SEO 11m
  6. 006 Types of SEO (The Complete Map) 12m
  7. 007 White Hat vs Black Hat vs Gray Hat 14m