Markers from Google — GSC verdicts as named external benchmarks, aligned against internal health signals
DARE.CO.UK · PARKED SKETCH · 2026-05-18
Mirrored from ~/.claude/.../memory/project_markers_from_google_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
2026-05-18 sketch. Dan’s framing: every GSC per-URL indexing verdict (excluded-by-noindex, crawled-not-indexed, discovered-not-indexed, 404, redirect, etc.) is a NAMED MARKER. The benchmark exercise is aligning our internal health signals (body_image_coverage, content_breadth, header_audit, etc.) against those markers per-URL — agreement = high confidence; disagreement = signal worth investigating. Resume on day-7 GSC re-check OR Dan’s interest in weekly automated benchmark runs.
Trigger context (2026-05-18): GSC’s Page Indexing report showed 1,221 not-indexed pages across 10 reasons. A 30-second local scan confirmed the 409 “Excluded by noindex” is almost certainly stale WordPress-era inventory (we emit noindex on 1 file: 404.html). Dan’s response framed the broader exercise: “All of your findings and assessments, in this, can be added to markers-from-google and we can align our health against it… it’s a good benchmark exercise.”
The framing — Google’s verdicts as markers:
Every URL on dare.co.uk has, at any moment, a verdict from Google: indexed, excluded-by-noindex, crawled-not-indexed, 404, redirect, etc. Treating these verdicts as markers (named external benchmark inputs) lets us:
- Align internal health signals against external judgment. We have body_image_coverage, content_breadth, header_audit, jsonld_presence, seo_title_audit. Google has its own quality verdict. When they agree (Google rejects + we know it’s thin) → high-confidence quality issue. When they disagree (Google rejects + we think it’s healthy) → either Google’s wrong or our signal is missing the dimension Google’s catching.
- Track which signals predict which markers. Over time, calibrate: does our content_breadth “stub” classification correlate with Google’s “crawled-not-indexed”? Does body_image_coverage’s “no body image” correlate with rejection? Build a predictive model.
- Benchmark health against an external authority. Google’s verdict is one of the few cross-vendor health benchmarks available. Treating it as a named input (not just a one-off observation) makes it composable.
Architecture sketch — dare_markers_from_google.py
Phase 1: ingest (manual CSV first, API later)
GSC offers two ingest paths:
- CSV export — manual click in GSC UI. Friction: human in the loop. Friction: 1000-row limit per export. Friction: only exports the currently-displayed reason / view.
- GSC API (Search Console API v1) — programmatic. Requires OAuth + service-account-level access. Once configured, automatable. Need a gcp-search-console-readonly service account + 1Password entry.
V1: CSV. Drop GSC’s exported Pages.csv into ~/Downloads/dare_gsc_page_indexing_<date>.csv; script reads, parses. Each row = {url, last_crawled, coverage_state} per GSC’s export shape. V2: API.
Phase 2: normalize
Bring each row into our common page-result shape:
�STASH6�
Phase 3: cross-reference with internal signals
For each URL, look up our internal signals:
- body_image_coverage → has-body-image / no-body-image
- content_breadth → bucket: stub / brief / medium / long-form
- header_audit → canonical / drift / no-header
- jsonld_presence → has-jsonld / no-jsonld
- seo_title_audit → categories of drift if any
- 404_audit → known-broken / OK
- Page age (mtime / publish_date)
Output per URL: {url, marker, internal_signals: {...}, alignment: "agree" | "disagree" | "neutral"}.
Phase 4: aggregate + render
Three views:
A. By marker × internal signal
| Google marker | content stub | content brief | content medium | content long-form |
|------------------------|-------------:|--------------:|---------------:|------------------:|
| Indexed | 12 | 45 | 87 | 156 |
| Crawled-not-indexed | 186 | 167 | 73 | 22 |
| Discovered-not-indexed | 132 | 98 | 36 | 10 |
| Excluded-by-noindex | 4 | 2 | 1 | 0 |
The first numbers tell the story: Google rejects most of our stubs + briefs. Long-form is mostly indexed. → priority cohort = stub + brief that Google has rejected.
B. Alignment summary
- AGREE — internal "thin" + Google "rejected": N pages — high-confidence rewrite/delete candidates.
- DISAGREE — internal "healthy" + Google "rejected": N pages — investigate why Google's not indexing despite our signals being green.
- DISAGREE — internal "thin" + Google "indexed": N pages — Google's giving us a pass; consider whether they could be more discoverable with more depth.
- NEUTRAL — everything else.
C. Per-URL drilldown table for the disagreement cohorts — those are the highest-leverage investigation queue.
Phase 5: time-series (when API integration lands)
Daily/weekly pull → store snapshots → render trend chart per marker. Did “crawled-not-indexed” drop after a content quality pass? Did “indexed” grow after a JSON-LD rollout? The smoothed-area chart pattern (feedback_smoothed_area_chart_over_time.md) is the canonical visual.
Today’s baseline (2026-05-18)
For reference when the day-7 re-check happens:
| Reason | Source | Count |
|---|---|---|
| Excluded by ‘noindex’ tag | Website | 409 |
| Page with redirect | Website | 37 |
| Not found (404) | Website | 25 |
| Alternate page with proper canonical tag | Website | 12 |
| Blocked due to other 4xx issue | Website | 8 |
| Server error (5xx) | Website | 1 |
| Blocked due to access forbidden (403) | Website | 1 |
| Crawled - currently not indexed | Google systems | 448 |
| Discovered - currently not indexed | Google systems | 276 |
| Duplicate, Google chose different canonical than user | Google systems | 4 |
| Total not-indexed | 1,221 |
Local scan confirmed: 1 file emits noindex (404.html). The 408+ remaining noindex hits in GSC are stale.
Cross-portfolio applicability
Same framework lifts to dogwood, audrey, client engagements as soon as each site has:
- A GSC property registered
- The toolkit’s local health audits running (*_content_breadth_audit, *_body_image_coverage, etc.)
The script’s interface is brand-agnostic: --repo, --csv, --audits-dir. The framework portability claim earns its keep once we run it on a second site.
Open design questions (decide when unparking)
- CSV vs API for V1. CSV is faster to ship (~2 hours) and validates the alignment logic. API is the right long-term shape (~half-day). Recommend CSV-first.
- Alignment thresholds. “Stub + crawled-not-indexed” = clear agreement. But what about “brief + crawled-not-indexed” — agreement or partial? Calibration needed on a 100-page sample.
- Time-series granularity. Once API lands: daily ingest? Weekly? GSC’s data is itself lagging by days, so weekly is probably sufficient.
- Output surface. Markdown report (sync to devreports per existing pattern) is the right v1. A dashboard card row (similar to Site Health) could land in v2.
Sibling memories
feedback_audit_first_then_batch.md— same shape applied to GSC: audit + classify + align BEFORE recommending rewrites.feedback_window_toggles_are_high_value.md+feedback_smoothed_area_chart_over_time.md— the time-series view when API lands.feedback_state_emerges_from_data.md— alignment categorization derived from existing signals, not manual tagging.project_bulk_content_cleanup_pipeline_parked.md— sibling sketch; both are “structured ingest of messy external data + audit-driven decisions about what to do.”user_oss_fit_growth_applicability_audit.md— same audit-framework stance applied to OSS evaluation. Markers-from-Google is the SEO-vendor analogue.
Resume conditions
- ✅ Day-7 GSC re-check (2026-05-25) shows the predicted noindex drop — validates the hypothesis + makes the residue actionable.
- ✅ Dan wants weekly automated benchmark runs (would trigger GSC API OAuth setup work).
- ✅ A new portfolio brand or client engagement gets to “first GSC indexing audit” state.
- Earliest qualifying trigger gets the V1 build (CSV-first); subsequent triggers exercise the cross-site portability.