Keyword Validation & Clustering
Validating volume, intent classification at scale, manual and SERP-overlap clustering, topic vs keyword clusters, and the master spreadsheet that drives the editorial roadmap.
A list of 2,000 keywords is not a strategy. It is a junk drawer. Validation strips out the noise — the volume estimates that lie, the intent classifications that miss the SERP reality. Clustering compresses the list into the actual page count you need to ship. Done right, you turn 2,000 keywords into 80 clusters into 80 pages, each targeting a real demand surface, each defensible from the others.
TL;DR
- Volume claims are unreliable until you triangulate. A single tool’s number is a guess. Cross-check against GSC, autosuggest presence, and SERP feature density before you trust it.
- Cluster by SERP overlap, not by string similarity. Two keywords belong on the same page if Google ranks the same URLs for both. String-based clustering (“running shoes” vs “running shoe” same page; “running shoes” vs “marathon shoes” different page) is a 2018 method.
- The master spreadsheet runs everything downstream. Keyword → cluster → target URL → intent → priority → assigned writer → publish date. One sheet, versioned, source-of-truth.
The mental model
Keyword clustering is like editing a chaotic recipe collection into a cookbook. The collection has 2,000 recipes. Some are duplicates with different titles (“chocolate chip cookies” and “classic chocolate chip”). Some are family-resemblance variants (“chewy” vs “crispy” chocolate chip cookies — different chapter). Some look unrelated but actually belong together (a brownie recipe and a cookie recipe share a chapter on quick desserts). The cookbook author has to decide: same chapter, separate chapter, or cut?
The decision rule is what does the reader want when they arrive? If a reader wants chewy chocolate chip cookies, the crispy recipe is a different page. If a reader wants any chocolate chip cookie, both belong together. The reader’s intent is the deciding factor, not the keyword strings.
Google has the same logic. The SERP is Google’s answer to “what do users want when they search this?” Two keywords with overlapping top-10 URLs are queries Google believes deserve the same answer page. That overlap is the cluster signal — not Levenshtein distance, not stemming, not synonym dictionaries.
Deep dive: the 2026 reality
Validation has three steps.
Step 1 — Volume validation. Pull the same keyword from Ahrefs, Semrush, and GKP (or any 3 sources). Compare. If they disagree by more than 5x, it is an estimation artifact. Rules of thumb:
- All three within 2x of each other → trust the average
- One source 5x+ higher than the others → that source is overestimating
- All three under 100 → effectively long-tail; treat as binary (worth pursuing or not)
Step 2 — Intent validation. Open the SERP. Read the top 5. Confirm the page format you plan to build matches the dominant format. Use the 30-second intent test from Module 8.
Step 3 — Click-through validation. Use Ahrefs’ Clicks-per-Search column. If it is below 0.4, the query is a zero-click query — most users get the answer in the SERP and never click. Build for it only if the brand mention itself is valuable (typical for top-of-funnel awareness).
Clustering has two methods.
Method A — Manual clustering for under ~200 keywords. Open the spreadsheet, sort by parent topic (Ahrefs supplies this in Keywords Explorer), eyeball groups, and assign cluster IDs.
Method B — SERP-overlap clustering for 200+. The algorithm:
- Get the top 10 URLs for each keyword (Ahrefs API, Semrush API, or scraping with DataForSEO).
- For each pair of keywords, compute the count of URLs in common.
- If two keywords share ≥3 URLs in the top 10, they belong on the same page.
- Form clusters via union-find / connected components.
- Pick a head keyword per cluster (highest-volume member that satisfies fit + intent).
Tools that automate this: Keyword Insights (keywordinsights.ai), Surfer SEO Cluster, and CTR-Driven by SISTRIX. Free option: write your own with the DataForSEO SERP API ($0.0006 per query).
Topic clusters vs keyword clusters.
| Concept | Definition | Output |
|---|---|---|
| Keyword cluster | A group of queries one page can target | One page per cluster |
| Topic cluster | A pillar page + supporting pages on related sub-topics | One pillar + 5-15 supporting pages |
Most “topic cluster” content marketing advice from 2018-2022 conflated the two. In practice you build keyword clusters first, then group multiple keyword clusters into topic clusters for internal-linking architecture.
The master spreadsheet structure that runs every editorial team:
| Column | Purpose |
|---|---|
keyword | Raw query string |
keyword_norm | Lowercased, trimmed, accent-stripped |
cluster_id | UUID or slug shared with cluster siblings |
is_head | Boolean — is this the cluster’s primary keyword |
target_url | Slug or full URL the page will live at |
volume_ahrefs | Ahrefs estimate |
volume_semrush | Semrush estimate |
volume_gkp | GKP estimate |
traffic_potential | Ahrefs Traffic Potential |
clicks_per_search | Ahrefs CPS |
kd | Keyword difficulty |
intent | Informational / Commercial / Transactional / Navigational |
micro_intent | Definitional / Comparison / Troubleshooting / etc. |
serp_features | AI Overview / PAA / Local pack / Shopping |
funnel_stage | TOFU / MOFU / BOFU |
priority | P0 / P1 / P2 / P3 |
status | Idea / Drafted / Reviewed / Published / Updated |
owner | Writer assigned |
publish_date | Target date |
published_url | Once shipped |
notes | Anything tactical |
This sheet is the single source of truth. Everything downstream — content briefs, internal linking, monthly review, analytics tags — references it.
Visualizing it
flowchart TD
Raw["Raw keyword pool (2000 rows)"] --> V1["Volume validation (3-tool cross-check)"]
V1 --> V2["Intent validation (30-sec SERP test)"]
V2 --> V3["Click validation (CPS >= 0.4)"]
V3 --> Valid["Validated keyword pool (~1200 rows)"]
Valid --> SERP["Pull top 10 URLs per keyword"]
SERP --> Overlap["Compute pairwise SERP overlap"]
Overlap --> Clust["Union-find clustering (>=3 URLs in common)"]
Clust --> Master["Master spreadsheet (~80 clusters)"]
Master --> Topic["Group clusters into topic pillars"]
Topic --> Plan["Editorial roadmap"]
Bad vs. expert
The bad approach
Process:
1. Open keyword list of 2000 rows.
2. Group manually by string similarity.
3. "Running shoes" and "running shoe" → same page.
4. "Best running shoes" and "best running shoes for women" → same page.
5. Hand off to writers.
Result:
- "Best running shoes" SERP is 9 listicles ranking on a single comprehensive page each.
- "Best running shoes for women" SERP is 9 different listicles, women-specific, separate pages.
- Google ranks the women-specific listicles for women queries; one consolidated page ranks for neither.
This fails because string similarity does not match Google’s intent classification. Two keywords that look the same can have completely different SERPs. Putting them on one page means the page is too broad to win either query.
The expert approach
# serp_overlap_clustering.py — cluster by URL overlap, not string similarity
import pandas as pd
from itertools import combinations
import networkx as nx
# Input: dataframe with columns keyword, top_urls (list of 10 URLs)
df = pd.read_csv("validated_keywords_with_serp.csv")
df["top_urls"] = df["top_urls"].str.split("|") # pipe-delimited list
THRESHOLD = 3 # URLs in common required to merge
G = nx.Graph()
for kw in df["keyword"]:
G.add_node(kw)
# Build edges where SERP overlap >= threshold
for (i, row_a), (j, row_b) in combinations(df.iterrows(), 2):
overlap = len(set(row_a["top_urls"]) & set(row_b["top_urls"]))
if overlap >= THRESHOLD:
G.add_edge(row_a["keyword"], row_b["keyword"], weight=overlap)
# Connected components are clusters
clusters = list(nx.connected_components(G))
# Assign cluster IDs to each keyword
kw_to_cluster = {}
for idx, cluster in enumerate(clusters):
for kw in cluster:
kw_to_cluster[kw] = f"cluster_{idx:04d}"
df["cluster_id"] = df["keyword"].map(kw_to_cluster)
# Pick head keyword per cluster (highest traffic potential)
head = df.sort_values("traffic_potential", ascending=False).drop_duplicates("cluster_id")
df["is_head"] = df["keyword"].isin(head["keyword"])
df.to_csv("clustered_keywords.csv", index=False)
print(f"Clustered {len(df)} keywords into {df['cluster_id'].nunique()} clusters.")
This works because the cluster decision is rooted in Google’s own behavior. Two keywords that share URLs in the top 10 are queries Google has decided to answer with the same kind of page. One page, one cluster, one ranking opportunity per cluster — no internal cannibalization, no fractured authority.
Do this today
- Open your shortlist from Module 9 / 10. Verify each row has volume from at least two tools (Ahrefs, Semrush, GKP). Add columns for any missing source.
- Flag any keyword where volume estimates differ by more than 5x. Investigate manually — usually one tool is wrong, or it’s a multi-intent query worth splitting.
- Run a 30-second SERP test on the top 50 keywords. Verify intent and SERP feature columns match what you’d build.
- Open the Ahrefs Clicks-per-Search column. Cut any keyword with CPS < 0.4 unless brand-mention value justifies it.
- For 200+ keywords, use Keyword Insights at
keywordinsights.ai($58 for 1K keywords) or Surfer SEO Cluster to cluster by SERP overlap. For under 200, do it manually by sorting by parent topic and inspecting top URLs. - Build the master spreadsheet with the columns from the table above. Use Google Sheets or Airtable for collaborative editing. Lock the column structure with a header row.
- For each cluster, name the target URL slug (
/best-crm-small-business/, not/post-id-9012/). Mark the head keyword. Save the file askeyword-master-q2-2026.xlsx. This sheet is the input to Module 12 (Prioritization).
Mark complete
Toggle to remember this module as mastered. Saved to your browser only.
More in this part