Module 030 Intermediate 12 min read

XML Sitemaps

Structure, sitemap indexes for large sites, image/video/news/hreflang variants, lastmod hygiene, and submission to Google and Bing.

By SEO Mastery Editorial

An XML sitemap is the URL list you hand the librarian: “these are the books I want in the catalog.” It does not guarantee indexation, but for any site over a few hundred pages it materially improves discovery, helps Google distinguish freshly updated pages from stale ones, and provides the only crawler-readable hint of <lastmod>, hreflang, image, video, and news context.

TL;DR

  • <lastmod> is the most important element you under-use. Google and Bing both confirmed in 2023–2024 that they trust honest, accurate <lastmod> dates and treat fabricated ones as a spam signal. Set it from your database updatedAt field, not Date.now() at build.
  • Use a sitemap index for any site over ~10,000 URLs. The 50,000-URL / 50 MB uncompressed limits per sitemap file are hard. Split by content type or section so you can monitor indexation per slice.
  • Submit to Google and Bing both. Bing’s sitemap is what populates ChatGPT Search and Copilot. Add the sitemap URL as a Sitemap: directive in robots.txt so AI crawlers like PerplexityBot and OAI-SearchBot can discover it.

The mental model

A sitemap is a curator’s wall label, not a backstage pass. It tells the museum which artifacts you consider noteworthy and when they were last updated, but the museum still decides which to display. A sitemap with 100,000 URLs and an index that holds only 30,000 is normal — the curator is being selective, and that is healthy.

The sitemap is also your freshness oracle. When a URL’s <lastmod> advances, Googlebot prioritizes recrawling that URL. When <lastmod> is the same as last visit, the bot can skip — saving crawl budget for actual changes. Fabricated dates poison this oracle.

Deep dive: the 2026 reality

Sitemap formats per the sitemaps.org 0.9 standard plus Google’s extensions:

VariantUseRequired for
Standard sitemapURL list with <loc>, <lastmod>Every site
Sitemap indexSitemap of sitemaps>10K URLs or multi-site
Image sitemapImage URLs per pageVisual-heavy sites
Video sitemapVideo metadata per pageVideo publishers
News sitemapArticles published in last 48hGoogle News inclusion
hreflang annotationsLanguage/region variantsMulti-locale sites

Hard limits: each sitemap file is capped at 50,000 URLs or 50 MB uncompressed. A sitemap index can reference up to 50,000 child sitemaps, giving you a theoretical ceiling of 2.5 billion URLs per index. In practice, you split by section for readability — sitemap-products.xml, sitemap-blog.xml, sitemap-locations.xml.

The <priority> and <changefreq> elements were officially deprecated by Google in 2017 and have not been used since. Bing and Yandex still parse them but weight <lastmod> far more heavily. Stop including them.

<lastmod> hygiene is now the actual technical lever. The November 2023 Search Off The Record episode and Gary Illyes’s 2024 SMX talk made it explicit: Google ignores <lastmod> it cannot trust, and trust is built by consistency. If your <lastmod> says 2026-05-07 but the page’s content has not changed in two years, Google learns to ignore your dates entirely.

hreflang annotations belong either in the sitemap or in <link rel="alternate" hreflang> tags or in HTTP headers — pick one and be consistent. Mixing produces ambiguity. The sitemap pattern is preferred for sites with > 100 locale-pair combinations, because one XML edit beats updating every page.

Visualizing it

flowchart TD
  R["robots.txt: Sitemap: https://example.com/sitemap-index.xml"] --> I[sitemap-index.xml]
  I --> A[sitemap-products.xml<br/>40K URLs]
  I --> B[sitemap-blog.xml<br/>3K URLs]
  I --> C[sitemap-locations.xml<br/>12K URLs]
  I --> D[sitemap-news.xml<br/>last 48h]
  I --> E[sitemap-images.xml<br/>20K image entries]
  A --> G[Googlebot]
  A --> N[Bingbot]
  B --> G
  B --> N
  D --> G2[Google News]

Bad vs. expert

The bad approach

The bad sitemap is generated once at build, has no <lastmod>, lists URLs the site noindexes, and lives at a path nobody declared:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/blog/post-1</loc>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/admin/login</loc>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>https://example.com/blog/post-1?utm_source=newsletter</loc>
    <priority>0.5</priority>
  </url>
</urlset>

Three failures: <lastmod> is missing so Google has no freshness signal; /admin/login is in the sitemap but presumably noindex/blocked, sending a contradictory signal; the UTM-tagged variant is a duplicate of the canonical. Sitemap consistency reports in GSC will flag this and crawl efficiency drops.

The expert approach

A sitemap built on real data, served dynamically, with honest <lastmod>. Astro endpoint pattern:

// src/pages/sitemap-blog.xml.ts
import type { APIRoute } from 'astro';
import { getCollection } from 'astro:content';

export const GET: APIRoute = async ({ site }) => {
  const posts = await getCollection('blog', ({ data }) => !data.draft);

  const urls = posts.map((p) => ({
    loc: new URL(`/blog/${p.slug}/`, site).toString(),
    lastmod: p.data.updatedAt.toISOString(),
  }));

  const xml = `<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
${urls.map((u) => `  <url>
    <loc>${u.loc}</loc>
    <lastmod>${u.lastmod}</lastmod>
  </url>`).join('\n')}
</urlset>`;

  return new Response(xml, {
    headers: { 'Content-Type': 'application/xml; charset=utf-8' },
  });
};

Sitemap index for a 200K-URL site:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products-1.xml</loc>
    <lastmod>2026-05-07T08:14:21Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products-2.xml</loc>
    <lastmod>2026-05-07T08:14:21Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-05-06T19:02:58Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-news.xml</loc>
    <lastmod>2026-05-07T07:30:00Z</lastmod>
  </sitemap>
</sitemapindex>

hreflang in the sitemap (avoids per-page tag maintenance):

<url>
  <loc>https://example.com/en-us/products/widget</loc>
  <lastmod>2026-04-22T12:00:00Z</lastmod>
  <xhtml:link rel="alternate" hreflang="en-us" href="https://example.com/en-us/products/widget"/>
  <xhtml:link rel="alternate" hreflang="en-gb" href="https://example.com/en-gb/products/widget"/>
  <xhtml:link rel="alternate" hreflang="de-de" href="https://example.com/de-de/products/widget"/>
  <xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/en-us/products/widget"/>
</url>

News sitemap entry (only articles published in last 48 hours):

<url>
  <loc>https://example.com/news/2026/05/07/markets-rally</loc>
  <news:news>
    <news:publication>
      <news:name>Example News</news:name>
      <news:language>en</news:language>
    </news:publication>
    <news:publication_date>2026-05-07T07:30:00Z</news:publication_date>
    <news:title>Markets Rally on Soft Inflation Data</news:title>
  </news:news>
</url>

Image sitemap inline within the standard sitemap:

<url>
  <loc>https://example.com/blog/canonicals</loc>
  <lastmod>2026-04-12T14:33:00Z</lastmod>
  <image:image>
    <image:loc>https://example.com/img/canonical-flow.png</image:loc>
    <image:caption>Canonical signal flow diagram</image:caption>
  </image:image>
</url>

Do this today

  1. Verify your sitemap loads at the expected URL. Open https://yourdomain.com/sitemap.xml (or /sitemap-index.xml) and confirm a 200 OK with Content-Type: application/xml.
  2. Validate XML structure at xml-sitemaps.com/validate-xml-sitemap.html or via xmllint --schema sitemap.xsd sitemap.xml. Catch malformed entries before submission.
  3. Audit <lastmod> honesty. Pick 10 URLs at random from your sitemap. For each, compare <lastmod> to the actual database updatedAt and to the page’s last visible content change. Discrepancies > 30 days mean your sitemap is lying to Google.
  4. In Google Search Console > Indexing > Sitemaps, submit the sitemap (or sitemap index) URL. Watch the Status column. Read count = URLs Google parsed; Indexed is reported separately under Pages.
  5. In Bing Webmaster Tools > Sitemaps, add the same sitemap. Bing’s index feeds ChatGPT Search and Copilot, so this is your AI search submission.
  6. Add a Sitemap: directive at the bottom of your robots.txt for every sitemap or sitemap index. This is how PerplexityBot, OAI-SearchBot, and other AI crawlers discover your URL list.
  7. For sites > 10,000 URLs, split into a sitemap index by content type so you can track indexation per slice. In GSC, each child sitemap shows its own indexation rate — a regression in sitemap-products.xml is far easier to diagnose than a regression in one giant file.
  8. If you publish news, add a separate sitemap-news.xml with only the last 48 hours of articles per Google News spec. Submit it under Google Publisher Center.
  9. For multi-locale sites, choose one hreflang mechanism — sitemap, page tags, or HTTP headers — and remove the others. Use GSC > International Targeting (legacy report still functional) to confirm hreflang annotation count matches your URL count.
  10. Schedule a weekly job that regenerates the sitemap and pings https://www.bing.com/webmaster/api.svc/json/SubmitUrlBatch (Bing’s batch URL submission API). Google does not need ping-style submission anymore — <lastmod> does the work.

Mark complete

Toggle to remember this module as mastered. Saved to your browser only.

More in this part

Part 5: Technical SEO

View all on the home page →
  1. 026 Technical SEO Fundamentals 12m
  2. 027 Site Architecture 20m
  3. 028 Crawling & Indexing 17m
  4. 029 robots.txt Deep Dive 15m
  5. 030 XML Sitemaps You're here 12m
  6. 031 Canonical Tags 20m
  7. 032 Meta Robots & X-Robots-Tag 13m
  8. 033 HTTP Status Codes 15m
  9. 034 Crawl Budget Management 16m
  10. 035 JavaScript SEO 26m
  11. 036 Core Web Vitals 17m
  12. 037 Site Speed & Performance 19m
  13. 038 HTTPS & Site Security 12m
  14. 039 Mobile SEO & Mobile-First Indexing 14m
  15. 040 Structured Data & Schema Markup 17m
  16. 041 International SEO (hreflang) 19m
  17. 042 Pagination 12m
  18. 043 Faceted Navigation 26m
  19. 044 Duplicate Content 13m
  20. 045 Site Migrations 24m