Markers from Google — GSC verdicts as named external benchmarks, aligned against internal health signals

DARE.CO.UK · PARKED SKETCH · 2026-05-18

Mirrored from ~/.claude/.../memory/project_markers_from_google_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

2026-05-18 sketch. Dan’s framing: every GSC per-URL indexing verdict (excluded-by-noindex, crawled-not-indexed, discovered-not-indexed, 404, redirect, etc.) is a NAMED MARKER. The benchmark exercise is aligning our internal health signals (body_image_coverage, content_breadth, header_audit, etc.) against those markers per-URL — agreement = high confidence; disagreement = signal worth investigating. Resume on day-7 GSC re-check OR Dan’s interest in weekly automated benchmark runs.


Trigger context (2026-05-18): GSC’s Page Indexing report showed 1,221 not-indexed pages across 10 reasons. A 30-second local scan confirmed the 409 “Excluded by noindex” is almost certainly stale WordPress-era inventory (we emit noindex on 1 file: 404.html). Dan’s response framed the broader exercise: “All of your findings and assessments, in this, can be added to markers-from-google and we can align our health against it… it’s a good benchmark exercise.”

The framing — Google’s verdicts as markers:

Every URL on dare.co.uk has, at any moment, a verdict from Google: indexed, excluded-by-noindex, crawled-not-indexed, 404, redirect, etc. Treating these verdicts as markers (named external benchmark inputs) lets us:

  1. Align internal health signals against external judgment. We have body_image_coverage, content_breadth, header_audit, jsonld_presence, seo_title_audit. Google has its own quality verdict. When they agree (Google rejects + we know it’s thin) → high-confidence quality issue. When they disagree (Google rejects + we think it’s healthy) → either Google’s wrong or our signal is missing the dimension Google’s catching.
  2. Track which signals predict which markers. Over time, calibrate: does our content_breadth “stub” classification correlate with Google’s “crawled-not-indexed”? Does body_image_coverage’s “no body image” correlate with rejection? Build a predictive model.
  3. Benchmark health against an external authority. Google’s verdict is one of the few cross-vendor health benchmarks available. Treating it as a named input (not just a one-off observation) makes it composable.

Architecture sketch — dare_markers_from_google.py

Phase 1: ingest (manual CSV first, API later)

GSC offers two ingest paths: - CSV export — manual click in GSC UI. Friction: human in the loop. Friction: 1000-row limit per export. Friction: only exports the currently-displayed reason / view. - GSC API (Search Console API v1) — programmatic. Requires OAuth + service-account-level access. Once configured, automatable. Need a gcp-search-console-readonly service account + 1Password entry.

V1: CSV. Drop GSC’s exported Pages.csv into ~/Downloads/dare_gsc_page_indexing_<date>.csv; script reads, parses. Each row = {url, last_crawled, coverage_state} per GSC’s export shape. V2: API.

Phase 2: normalize

Bring each row into our common page-result shape:

�STASH6�

Phase 3: cross-reference with internal signals

For each URL, look up our internal signals: - body_image_coverage → has-body-image / no-body-image - content_breadth → bucket: stub / brief / medium / long-form - header_audit → canonical / drift / no-header - jsonld_presence → has-jsonld / no-jsonld - seo_title_audit → categories of drift if any - 404_audit → known-broken / OK - Page age (mtime / publish_date)

Output per URL: {url, marker, internal_signals: {...}, alignment: "agree" | "disagree" | "neutral"}.

Phase 4: aggregate + render

Three views:

A. By marker × internal signal

| Google marker          | content stub | content brief | content medium | content long-form |
|------------------------|-------------:|--------------:|---------------:|------------------:|
| Indexed                |          12  |           45  |            87  |              156  |
| Crawled-not-indexed    |         186  |          167  |            73  |               22  |
| Discovered-not-indexed |         132  |           98  |            36  |               10  |
| Excluded-by-noindex    |           4  |            2  |             1  |                0  |

The first numbers tell the story: Google rejects most of our stubs + briefs. Long-form is mostly indexed. → priority cohort = stub + brief that Google has rejected.

B. Alignment summary

- AGREE — internal "thin" + Google "rejected": N pages — high-confidence rewrite/delete candidates.
- DISAGREE — internal "healthy" + Google "rejected": N pages — investigate why Google's not indexing despite our signals being green.
- DISAGREE — internal "thin" + Google "indexed": N pages — Google's giving us a pass; consider whether they could be more discoverable with more depth.
- NEUTRAL — everything else.

C. Per-URL drilldown table for the disagreement cohorts — those are the highest-leverage investigation queue.

Phase 5: time-series (when API integration lands)

Daily/weekly pull → store snapshots → render trend chart per marker. Did “crawled-not-indexed” drop after a content quality pass? Did “indexed” grow after a JSON-LD rollout? The smoothed-area chart pattern (feedback_smoothed_area_chart_over_time.md) is the canonical visual.

Today’s baseline (2026-05-18)

For reference when the day-7 re-check happens:

Reason Source Count
Excluded by ‘noindex’ tag Website 409
Page with redirect Website 37
Not found (404) Website 25
Alternate page with proper canonical tag Website 12
Blocked due to other 4xx issue Website 8
Server error (5xx) Website 1
Blocked due to access forbidden (403) Website 1
Crawled - currently not indexed Google systems 448
Discovered - currently not indexed Google systems 276
Duplicate, Google chose different canonical than user Google systems 4
Total not-indexed 1,221

Local scan confirmed: 1 file emits noindex (404.html). The 408+ remaining noindex hits in GSC are stale.

Cross-portfolio applicability

Same framework lifts to dogwood, audrey, client engagements as soon as each site has: - A GSC property registered - The toolkit’s local health audits running (*_content_breadth_audit, *_body_image_coverage, etc.)

The script’s interface is brand-agnostic: --repo, --csv, --audits-dir. The framework portability claim earns its keep once we run it on a second site.

Open design questions (decide when unparking)

  1. CSV vs API for V1. CSV is faster to ship (~2 hours) and validates the alignment logic. API is the right long-term shape (~half-day). Recommend CSV-first.
  2. Alignment thresholds. “Stub + crawled-not-indexed” = clear agreement. But what about “brief + crawled-not-indexed” — agreement or partial? Calibration needed on a 100-page sample.
  3. Time-series granularity. Once API lands: daily ingest? Weekly? GSC’s data is itself lagging by days, so weekly is probably sufficient.
  4. Output surface. Markdown report (sync to devreports per existing pattern) is the right v1. A dashboard card row (similar to Site Health) could land in v2.

Sibling memories

Resume conditions

Source: parked_sketch_markers_from_google_2026-05-18.md · Rendered 2026-05-18 12:53