XML Sitemaps
Structure, sitemap indexes for large sites, image/video/news/hreflang variants, lastmod hygiene, and submission to Google and Bing.
An XML sitemap is the URL list you hand the librarian: “these are the books I want in the catalog.” It does not guarantee indexation, but for any site over a few hundred pages it materially improves discovery, helps Google distinguish freshly updated pages from stale ones, and provides the only crawler-readable hint of <lastmod>, hreflang, image, video, and news context.
TL;DR
<lastmod>is the most important element you under-use. Google and Bing both confirmed in 2023–2024 that they trust honest, accurate<lastmod>dates and treat fabricated ones as a spam signal. Set it from your databaseupdatedAtfield, notDate.now()at build.- Use a sitemap index for any site over ~10,000 URLs. The 50,000-URL / 50 MB uncompressed limits per sitemap file are hard. Split by content type or section so you can monitor indexation per slice.
- Submit to Google and Bing both. Bing’s sitemap is what populates ChatGPT Search and Copilot. Add the sitemap URL as a
Sitemap:directive inrobots.txtso AI crawlers likePerplexityBotandOAI-SearchBotcan discover it.
The mental model
A sitemap is a curator’s wall label, not a backstage pass. It tells the museum which artifacts you consider noteworthy and when they were last updated, but the museum still decides which to display. A sitemap with 100,000 URLs and an index that holds only 30,000 is normal — the curator is being selective, and that is healthy.
The sitemap is also your freshness oracle. When a URL’s <lastmod> advances, Googlebot prioritizes recrawling that URL. When <lastmod> is the same as last visit, the bot can skip — saving crawl budget for actual changes. Fabricated dates poison this oracle.
Deep dive: the 2026 reality
Sitemap formats per the sitemaps.org 0.9 standard plus Google’s extensions:
| Variant | Use | Required for |
|---|---|---|
| Standard sitemap | URL list with <loc>, <lastmod> | Every site |
| Sitemap index | Sitemap of sitemaps | >10K URLs or multi-site |
| Image sitemap | Image URLs per page | Visual-heavy sites |
| Video sitemap | Video metadata per page | Video publishers |
| News sitemap | Articles published in last 48h | Google News inclusion |
hreflang annotations | Language/region variants | Multi-locale sites |
Hard limits: each sitemap file is capped at 50,000 URLs or 50 MB uncompressed. A sitemap index can reference up to 50,000 child sitemaps, giving you a theoretical ceiling of 2.5 billion URLs per index. In practice, you split by section for readability — sitemap-products.xml, sitemap-blog.xml, sitemap-locations.xml.
The <priority> and <changefreq> elements were officially deprecated by Google in 2017 and have not been used since. Bing and Yandex still parse them but weight <lastmod> far more heavily. Stop including them.
<lastmod> hygiene is now the actual technical lever. The November 2023 Search Off The Record episode and Gary Illyes’s 2024 SMX talk made it explicit: Google ignores <lastmod> it cannot trust, and trust is built by consistency. If your <lastmod> says 2026-05-07 but the page’s content has not changed in two years, Google learns to ignore your dates entirely.
hreflang annotations belong either in the sitemap or in <link rel="alternate" hreflang> tags or in HTTP headers — pick one and be consistent. Mixing produces ambiguity. The sitemap pattern is preferred for sites with > 100 locale-pair combinations, because one XML edit beats updating every page.
Visualizing it
flowchart TD
R["robots.txt: Sitemap: https://example.com/sitemap-index.xml"] --> I[sitemap-index.xml]
I --> A[sitemap-products.xml<br/>40K URLs]
I --> B[sitemap-blog.xml<br/>3K URLs]
I --> C[sitemap-locations.xml<br/>12K URLs]
I --> D[sitemap-news.xml<br/>last 48h]
I --> E[sitemap-images.xml<br/>20K image entries]
A --> G[Googlebot]
A --> N[Bingbot]
B --> G
B --> N
D --> G2[Google News]
Bad vs. expert
The bad approach
The bad sitemap is generated once at build, has no <lastmod>, lists URLs the site noindexes, and lives at a path nobody declared:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/blog/post-1</loc>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/admin/login</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://example.com/blog/post-1?utm_source=newsletter</loc>
<priority>0.5</priority>
</url>
</urlset>
Three failures: <lastmod> is missing so Google has no freshness signal; /admin/login is in the sitemap but presumably noindex/blocked, sending a contradictory signal; the UTM-tagged variant is a duplicate of the canonical. Sitemap consistency reports in GSC will flag this and crawl efficiency drops.
The expert approach
A sitemap built on real data, served dynamically, with honest <lastmod>. Astro endpoint pattern:
// src/pages/sitemap-blog.xml.ts
import type { APIRoute } from 'astro';
import { getCollection } from 'astro:content';
export const GET: APIRoute = async ({ site }) => {
const posts = await getCollection('blog', ({ data }) => !data.draft);
const urls = posts.map((p) => ({
loc: new URL(`/blog/${p.slug}/`, site).toString(),
lastmod: p.data.updatedAt.toISOString(),
}));
const xml = `<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
${urls.map((u) => ` <url>
<loc>${u.loc}</loc>
<lastmod>${u.lastmod}</lastmod>
</url>`).join('\n')}
</urlset>`;
return new Response(xml, {
headers: { 'Content-Type': 'application/xml; charset=utf-8' },
});
};
Sitemap index for a 200K-URL site:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products-1.xml</loc>
<lastmod>2026-05-07T08:14:21Z</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products-2.xml</loc>
<lastmod>2026-05-07T08:14:21Z</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2026-05-06T19:02:58Z</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-news.xml</loc>
<lastmod>2026-05-07T07:30:00Z</lastmod>
</sitemap>
</sitemapindex>
hreflang in the sitemap (avoids per-page tag maintenance):
<url>
<loc>https://example.com/en-us/products/widget</loc>
<lastmod>2026-04-22T12:00:00Z</lastmod>
<xhtml:link rel="alternate" hreflang="en-us" href="https://example.com/en-us/products/widget"/>
<xhtml:link rel="alternate" hreflang="en-gb" href="https://example.com/en-gb/products/widget"/>
<xhtml:link rel="alternate" hreflang="de-de" href="https://example.com/de-de/products/widget"/>
<xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/en-us/products/widget"/>
</url>
News sitemap entry (only articles published in last 48 hours):
<url>
<loc>https://example.com/news/2026/05/07/markets-rally</loc>
<news:news>
<news:publication>
<news:name>Example News</news:name>
<news:language>en</news:language>
</news:publication>
<news:publication_date>2026-05-07T07:30:00Z</news:publication_date>
<news:title>Markets Rally on Soft Inflation Data</news:title>
</news:news>
</url>
Image sitemap inline within the standard sitemap:
<url>
<loc>https://example.com/blog/canonicals</loc>
<lastmod>2026-04-12T14:33:00Z</lastmod>
<image:image>
<image:loc>https://example.com/img/canonical-flow.png</image:loc>
<image:caption>Canonical signal flow diagram</image:caption>
</image:image>
</url>
Do this today
- Verify your sitemap loads at the expected URL. Open
https://yourdomain.com/sitemap.xml(or/sitemap-index.xml) and confirm a 200 OK withContent-Type: application/xml. - Validate XML structure at xml-sitemaps.com/validate-xml-sitemap.html or via
xmllint --schema sitemap.xsd sitemap.xml. Catch malformed entries before submission. - Audit
<lastmod>honesty. Pick 10 URLs at random from your sitemap. For each, compare<lastmod>to the actual databaseupdatedAtand to the page’s last visible content change. Discrepancies > 30 days mean your sitemap is lying to Google. - In Google Search Console > Indexing > Sitemaps, submit the sitemap (or sitemap index) URL. Watch the Status column. Read count = URLs Google parsed; Indexed is reported separately under Pages.
- In Bing Webmaster Tools > Sitemaps, add the same sitemap. Bing’s index feeds ChatGPT Search and Copilot, so this is your AI search submission.
- Add a
Sitemap:directive at the bottom of yourrobots.txtfor every sitemap or sitemap index. This is how PerplexityBot, OAI-SearchBot, and other AI crawlers discover your URL list. - For sites > 10,000 URLs, split into a sitemap index by content type so you can track indexation per slice. In GSC, each child sitemap shows its own indexation rate — a regression in
sitemap-products.xmlis far easier to diagnose than a regression in one giant file. - If you publish news, add a separate
sitemap-news.xmlwith only the last 48 hours of articles per Google News spec. Submit it under Google Publisher Center. - For multi-locale sites, choose one
hreflangmechanism — sitemap, page tags, or HTTP headers — and remove the others. Use GSC > International Targeting (legacy report still functional) to confirmhreflangannotation count matches your URL count. - Schedule a weekly job that regenerates the sitemap and pings
https://www.bing.com/webmaster/api.svc/json/SubmitUrlBatch(Bing’s batch URL submission API). Google does not need ping-style submission anymore —<lastmod>does the work.
Mark complete
Toggle to remember this module as mastered. Saved to your browser only.
More in this part
Part 5: Technical SEO
- 026 Technical SEO Fundamentals 12m
- 027 Site Architecture 20m
- 028 Crawling & Indexing 17m
- 029 robots.txt Deep Dive 15m
- 030 XML Sitemaps You're here 12m
- 031 Canonical Tags 20m
- 032 Meta Robots & X-Robots-Tag 13m
- 033 HTTP Status Codes 15m
- 034 Crawl Budget Management 16m
- 035 JavaScript SEO 26m
- 036 Core Web Vitals 17m
- 037 Site Speed & Performance 19m
- 038 HTTPS & Site Security 12m
- 039 Mobile SEO & Mobile-First Indexing 14m
- 040 Structured Data & Schema Markup 17m
- 041 International SEO (hreflang) 19m
- 042 Pagination 12m
- 043 Faceted Navigation 26m
- 044 Duplicate Content 13m
- 045 Site Migrations 24m