Markers from Google — GSC verdicts as named external benchmarks, aligned against internal health signals
DARE.CO.UK · PARKED SKETCH · 2026-05-31
Mirrored from ~/.claude/.../memory/project_markers_from_google_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
2026-05-18 sketch. Dan’s framing: every GSC per-URL indexing verdict (excluded-by-noindex, crawled-not-indexed, discovered-not-indexed, 404, redirect, etc.) is a NAMED MARKER. The benchmark exercise is aligning our internal health signals (body_image_coverage, content_breadth, header_audit, etc.) against those markers per-URL — agreement = high confidence; disagreement = signal worth investigating. Resume on day-7 GSC re-check OR Dan’s interest in weekly automated benchmark runs.
Trigger context (2026-05-18): GSC’s Page Indexing report showed 1,221 not-indexed pages across 10 reasons. A 30-second local scan confirmed the 409 “Excluded by noindex” is almost certainly stale WordPress-era inventory (we emit noindex on 1 file: 404.html). Dan’s response framed the broader exercise: “All of your findings and assessments, in this, can be added to markers-from-google and we can align our health against it… it’s a good benchmark exercise.”
The framing — Google’s verdicts as markers:
Every URL on dare.co.uk has, at any moment, a verdict from Google: indexed, excluded-by-noindex, crawled-not-indexed, 404, redirect, etc. Treating these verdicts as markers (named external benchmark inputs) lets us:
- Align internal health signals against external judgment. We have body_image_coverage, content_breadth, header_audit, jsonld_presence, seo_title_audit. Google has its own quality verdict. When they agree (Google rejects + we know it’s thin) → high-confidence quality issue. When they disagree (Google rejects + we think it’s healthy) → either Google’s wrong or our signal is missing the dimension Google’s catching.
- Track which signals predict which markers. Over time, calibrate: does our content_breadth “stub” classification correlate with Google’s “crawled-not-indexed”? Does body_image_coverage’s “no body image” correlate with rejection? Build a predictive model.
- Benchmark health against an external authority. Google’s verdict is one of the few cross-vendor health benchmarks available. Treating it as a named input (not just a one-off observation) makes it composable.
Architecture sketch — dare_markers_from_google.py
Phase 1: ingest (manual CSV first, API later)
GSC offers two ingest paths:
- CSV export — manual click in GSC UI. Friction: human in the loop. Friction: 1000-row limit per export. Friction: only exports the currently-displayed reason / view.
- GSC API (Search Console API v1) — programmatic. Requires OAuth + service-account-level access. Once configured, automatable. Need a gcp-search-console-readonly service account + 1Password entry.
V1: CSV. Drop GSC’s exported Pages.csv into ~/Downloads/dare_gsc_page_indexing_<date>.csv; script reads, parses. Each row = {url, last_crawled, coverage_state} per GSC’s export shape. V2: API.
Phase 2: normalize
Bring each row into our common page-result shape: