dare_sitemap_regen.py — architecture sketch · 2026-05-14

Parked workstream. Portfolio-portable static-site sitemap regenerator that emits valid sitemap.xml + post-sitemap.xml + page-sitemap.xml from a static repo. Designed to retire the WP-era frozen-in-amber post-sitemap.xml (last edited 2026-05-07; carries 506 dead http://edge.dare.co.uk/wp/* entries). Lives in ~/bin/ so the same script handles dare → dogwood → audrey → client work via config swap.

Genesis: today’s Edge-cohort migration left the sitemap stale (3 entries fixed by hand, commit 24b8736e). Doing that 506 more times by hand is a non-starter; generating it from current repo state is the right substrate.

What it does

One walk of the repo, three outputs:

sitemap.xml — the index. References the children with current <lastmod> values.
post-sitemap.xml — article-tree <url> blocks (one per migrated article), each carrying <image:image> children for body imagery.
page-sitemap.xml — top-level static pages (/contact/, /privacy-policy/, policy pages, the four section listings).

Auto-shards if >= 45,000 URLs or >= 45 MB. Auto-skips if no diff vs deployed sitemap. Emits a dated report to ~/Downloads/dare_sitemap_regen_<date>.md per the always-publish rule.

Inputs (config-driven, never hardcoded)

A per-site YAML at ~/.config/dare-sitemap/<site>.yaml:

repo_root:       /Users/dansellars/Code/dare-co-uk
canonical_base:  https://www.dare.co.uk
image_cdn_base:  https://images.dare.co.uk
excludes:
  - wp-content/**
  - wp-includes/**
  - "**/*.bak-*"
  - "**/index.html.bak-*"
  - error-404/**
  - dare_migrate_failures_*.txt
redirect_file:   _redirects        # source-column entries get skipped from sitemap
post_paths:                         # globs that go into post-sitemap.xml
  - architecture/*/index.html
  - cinema/*/index.html
  - methods-of-business-design/*/index.html
  - culture-means-thriving-teams/*/index.html
  - field-notes-from-business-design/*/index.html
  - daring-acts/*/index.html
  - observations/*/index.html
  - books/*/index.html
  - brands/*/index.html
  - albums/*/index.html
  - photography/*/index.html
  - users/*/index.html
  - archive/*/index.html
page_paths:                         # globs that go into page-sitemap.xml
  - contact.html
  - privacy-policy/**
  - anti-spam-policy/**
  - dmca-policy/**
  - sitemap/**
  - "*/archive/index.html"          # section root archives
  - methods-of-business-design/index.html
  - culture-means-thriving-teams/index.html
  - field-notes-from-business-design/index.html
  - daring-acts/index.html
homepage:        index.html         # treated as a top-level "page"

Dogwood / audrey / client engagements each get their own YAML; the script is otherwise identical.

The walk (one function per concern)

collect(config) →
    for each html_path in walk(repo, includes=post+page, excludes):
        parsed = lxml.html.parse(html_path)
        canonical = extract_canonical(parsed)        # <link rel="canonical">
        lastmod   = extract_lastmod(html_path, parsed)
            ↪ priority: JSON-LD dateModified > article:modified_time
                       > git log -1 --format=%cI -- <file> > file mtime
        images    = extract_images(parsed)           # see classify_images() below
        yield Entry(canonical, lastmod, images, is_post=in_post_globs)

render(entries) →
    sort by canonical URL                            # stable output
    split into post/page/maybe-image-only sitemaps
    serialise via ElementTree (escapes &, no BOM, LF endings)
    write to repo_root/<sitemap>.xml

validate(written) →
    xmllint --noout if available
    head-check each <loc> via Mozilla UA + cached results
    emit count summary to ~/Downloads/dare_sitemap_regen_<date>.md

Pitfalls + traps (the meat of the sketch)

Category A — Lastmod & timestamps

Trap	What goes wrong	Mitigation
File mtime is unreliable	Branch switches, `cp -p` drops, repo moves (~/Downloads→~/Code 2026-05-07) all clobber mtime.	Prefer JSON-LD `dateModified` (article-embedded, survives copies). Fall back to `git log -1 --format=%cI --follow <file>`. mtime is last resort only.
Future timestamps	Clock skew between local Mac + CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages runners + git committer date can emit lastmod values ahead of `now()`. Google’s parser warns / drops these.	Clamp lastmod to `min(extracted, now() - 60s)`.
Timezone-naive strings	`2026-05-14T13:00` (no zone) is technically invalid per W3C datetime.	Always emit `+00:00` suffix; convert all sources to UTC before emit.
Per-file git log is slow	700+ articles × `git log -1` = a few seconds per run. Survivable but noisy on cron.	Cache last-known-git-mtime per file in `~/.cache/dare-sitemap/git-mtime.json`, invalidate on git HEAD change.

Category B — Image discovery

Trap	What goes wrong	Mitigation
Regex-based `<img>` parse	Multi-line attributes (`<img\n src="..."`), HTML entities, conditional comments, embedded SVG break naïve regex. The August 2026-05-06 WP migration script ate an hour to this.	Use `lxml.html.parse` not regex. The investment compounds across audit + sitemap + 404-audit tools.
`/cdn-cgi/image/...` transforms in `<img src>`	Three Edge pages had `src="/cdn-cgi/image/format=auto,quality=85/wp-content/uploads/edge/X.jpg"`. Emitting that into a sitemap leaks Cloudflare’s edge transform path.	Resolve cdn-cgi prefixes back to canonical CDN URLs via a small regex: `/cdn-cgi/image/[^/]+/(.+)` → `<image_cdn_base>/<group>`. Same routine handles `format=auto,quality=85` variants.
Favicons & publisher logos as image entries	`<link rel="icon" href="...">` + JSON-LD `publisher.logo` reference `cropped-ziiiro-celeste.jpeg` / `snapshot.jpg` on every page. Sitemap would have 700+ duplicate entries.	Allowlist body-image extraction sources (`<img src>` inside `.article-body`, `og:image`, JSON-LD `image` array). Skip favicon, publisher.logo, twitter:image (usually a dup of og:image).
Base64 inline images	Any `<img src="data:image/...">` shouldn’t appear in sitemap (it’s not a URL).	Drop any `src` starting with `data:`.
`<picture>`/`srcset` multi-source	Modern articles may use `<picture>` with multiple `<source srcset="X.webp 1x, X@2x.webp 2x">`. Each is a separate URL.	Pick highest-resolution source per `<picture>` block (largest descriptor), emit once.
Cross-domain image references	A dare article that references an `audreyinc.com` image — should that appear in dare’s sitemap? Google’s spec allows cross-domain image refs in a sitemap, but it’s a weak ownership signal.	Default: keep but flag in the report. Add `--strict-same-origin` opt-in flag for cohorts where it matters.

Category C — URL canonicalisation

Trap	What goes wrong	Mitigation
`/foo/index.html` vs `/foo/`	Both serve the same content. Emitting both = duplicate-URL signal.	Always strip `/index.html` suffix; always trailing-slash for directory-style.
301-source URLs in sitemap	`/about/` 301s to `/`. Emitting `/about/` tells Google “this is a canonical URL” — contradicts the 301.	Parse `_redirects`; skip source entries from sitemap. Emit only redirect targets.
Query strings	`<image:loc>https://x.co/img?w=200&h=300</image:loc>` is invalid XML — `&` needs `&`.	Use ElementTree’s `.text` setter (auto-escapes). Never string-concat XML.
Mixed http/https	Older WP exports sometimes have `http://` URLs alongside `https://`. Sitemap mixing the schemes signals confusion.	Force `canonical_base` scheme on all output.
Trailing whitespace in `<loc>`	`<loc> https://... </loc>` — some validators choke.	`.strip()` every extracted value before write.

Category D — XML correctness

Trap	What goes wrong	Mitigation
BOM at file start	Some XML parsers stumble on a BOM in front of `<?xml?>`.	Open with `encoding="utf-8"` not `utf-8-sig`; verify first bytes.
CRLF endings	Cross-platform repos can mix CRLF/LF. Sitemap-validator tools sometimes report odd column numbers.	Hardcode `\n` line endings; pass `newline=""` to `open()` only when explicitly needed.
Hand-rolled XML	String concatenation of `<url><loc>...</loc>...</url>` breaks on first odd character.	Use `xml.etree.ElementTree` or `lxml.etree`. The 30 minutes invested here pays back forever.
Missing namespace declaration	If `<image:image>` appears without `xmlns:image="..."` on the root, Google ignores all image entries silently.	Always emit the full namespace block as the WP sitemap did: `<urlset xmlns="..." xmlns:image="...">`.

Category E — Size & sharding

Trap	What goes wrong	Mitigation
Single sitemap > 50,000 URLs	Hard Google limit. Whole sitemap dropped silently.	dare won’t hit this (700 articles) but dogwood’s photo archive could. Auto-shard at 45k URLs → `post-sitemap-1.xml`, `post-sitemap-2.xml`; update index.
Single sitemap > 50 MB uncompressed	Same rejection class.	Same sharding logic; threshold = `min(45k URLs, 45 MB)`.
Sitemap-index too deep	Sitemaps-of-sitemaps-of-sitemaps fails some crawlers.	Max one level of indirection: index → shards.

Category F — Robots.txt / _redirects coherence

Trap	What goes wrong	Mitigation
Disallowed URLs in sitemap	`robots.txt` says `Disallow: /wp-admin/` but sitemap lists `/wp-admin/foo`. Google interprets this as a contradictory signal and may demote the whole sitemap.	Parse `robots.txt` (the Cloudflare-managed block + custom rules); skip any URL covered by `Disallow:` patterns.
Redirect targets that 301 again	`/old/` → `/new/` → `/newer/`. Sitemap emitting `/new/` is still wrong.	Follow redirect chain to final 200; emit only the terminal URL. Cache the resolution.
Sitemap referenced from robots.txt but pointing wrong	`Sitemap: https://www.dare.co.uk/sitemap.xml` line in robots.txt is separately maintained; if we rename or move the index, both need updating.	Script verifies the robots.txt `Sitemap:` line matches its own output path. Warns (doesn’t auto-fix) on mismatch.

Category G — Idempotency & cron

Trap	What goes wrong	Mitigation
Run-to-run jitter from `datetime.now()`	If lastmod gets stamped from `now()`, every run rewrites every entry. Git churn explodes.	Lastmod is derived purely from inputs (JSON-LD / git log / mtime), never from `now()`.
Map ordering differences between Python versions	Dict iteration order shouldn’t matter (Python 3.7+ preserves insertion), but a refactor could re-introduce non-determinism.	Always `sorted(entries, key=lambda e: e.canonical)` before serialise.
Cron writes every run even when no content changed	Daily git commit even when zero diff = noise.	`diff` against current `repo_root/<sitemap>.xml`; skip write + commit if byte-identical. Emit summary report regardless (per always-publish).
Cached HEAD results going stale	A 7-day cache misses a CDN URL that just started 404ing.	TTL of 1-7 days, configurable. CDN URLs are stable by design (immutable per cache-control headers from `dare_s3_to_r2_promote.py`), so longer is fine.

Category H — Portfolio portability

Trap	What goes wrong	Mitigation
Hardcoded `dare.co.uk` strings	Every transfer to dogwood / audrey / client work requires sed-replace + careful diff. The whole point of the script breaks.	All site-specific values in config. Script body never mentions a site name.
Per-portfolio image-CDN routing differs	dare: `images.dare.co.uk`. dogwood: probably `images.dogwood.house` (TBD). audrey: `images.audreyinc.com` (TBD). Some sites may use a path-based CDN (`example.com/cdn/...`) instead of a hostname.	Config supports both `image_cdn_base` (hostname) and `image_cdn_prefix` (path-rewrite rule). Default to the first; clients with weird setups override.
Different sitemap structures per platform	Squarespace/Shopify already emit their own sitemap; replacing it is the wrong move. Cloudflare Pages static site = this script’s natural home.	Script refuses to run if `canonical_base` resolves to a non-`pages.dev` / non-CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF-Worker origin (or override with `--force`).
Per-site article-vs-page split	dare uses the WP convention (post/page); dogwood may not have an analogous distinction.	`post_paths` / `page_paths` are config globs. A site with no page distinction sets `page_paths: []`; script emits single sitemap.

Category I — CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF / bot interactions

Trap	What goes wrong	Mitigation
`Python-urllib/<ver>` UA blocked by CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF bot management	HEAD check returns 403 from CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF — script wrongly concludes the image is 404. Documented in `feedback_python_urllib_ua_cloudflare.md`; bit `dare_s3_to_r2_promote.py` on 2026-05-11.	Always send a Mozilla-ish UA on probes: `dare-pipeline-sitemap/1.0 (+Mozilla/5.0)`.
Hotlink protection on the image CDN	`images.dare.co.uk` 403s requests with non-`dare.co.uk` Referer. Sitemap probes from local dev → 403 false-negative.	Probe with no Referer set (CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF passes those; verified 2026-05-14 by cf-access). Document the constraint clearly.
Service-token-gated source sites	A staging or preview surface behind CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access — script can’t HEAD-check from CI without injecting Access creds.	`--skip-image-validation` flag for offline runs. Otherwise reach for the existing `cf-access` wrapper (`~/bin/cf-access`).
Headless probe = soft-404 mistaken for success	Per `feedback_screenshot_error_page_guards.md`, CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages returns 200 + custom-error-body for unknown paths. HEAD-check sees 200; script trusts it.	Match response body against known error-page fingerprints if probing pages (not images). Per-portfolio fingerprint list in config.

Category J — Image-sitemap-extension subtleties

Trap	What goes wrong	Mitigation
Same image referenced from N pages	Bare `<image:image>` block under each page’s `<url>` is correct (it’s a page-image relationship, not a global-image listing).	Don’t dedupe across pages. Per-page repetition is the spec.
`<image:caption>` from alt text	Useful signal for image search. Long alt text = good; missing alt = empty caption.	Include `<image:caption>` only when `alt=""` is non-empty and meaningful. Skip noisy alts (single-word, generic).
`<image:license>` for legal cohorts	The DARE archive has historical images with mixed licensing. Asserting a license per image without provenance is risky.	Omit `<image:license>` unless config provides a default + explicit allowlist.

What the report (the .md emitted each run) should say

Following the “so what + what next” report structure:

TL;DR — N pages indexed, N images cited, sitemap size before/after, run duration
What changed — diff summary vs the previous run (new pages, dropped pages, lastmod-only updates)
Watch items — pages with no canonical URL, pages with no images, pages with future-dated lastmod, pages still pointing at dead hosts
Recommendations — orphan articles (in tree but not linked from anywhere), image-search candidates (pages with strong images + weak traffic per GSC join), etc.

The report becomes the storytelling-substrate artefact for sitemap hygiene over time (per feedback_toolkit_as_storytelling_substrate.md).

Wire-up

Surface	How
Manual run	`~/bin/dare_sitemap_regen.py --site dare`
CI / cron	weekly via GHA in `dare-pipeline` repo; PR-only on diff (no auto-commit to main without staging-first per current discipline)
Devreports	auto-publish via `dare_dev_reports_refresh.sh`; pattern added to `REPORT_PATTERNS`
Memory	`project_*_sitemap_regen.md` per portfolio site, capturing per-site config + decisions
1Password	none — no secrets required for the local walk; HEAD probes use no auth on public CDNs

Compounding across portfolio

Site	Status	Notes
`dare.co.uk`	Genesis. ~700 articles + ~12 pages. WP-era sitemap to replace.	First-mover, edge cases will drive script design.
`dogwood.house`	Future. NYC/CT/Hamptons service. Will have photo galleries + service-area pages.	Sharding likely needed once gallery archive grows.
`audreyinc.com`	Future. Shopify-backed; Shopify emits its own sitemap, so this script applies only to the agent-discoverability gift-guide pages at `/gift-guide/*` (not the product catalog).	Hybrid: Shopify sitemap + this script’s gift-guide sitemap, linked from the index.
`dansellars.com`	Future. Personal site.	Standard pattern.
Client engagements	Future. Per-engagement YAML config.	Same binary, different config; same compounding model as the existing audit + 404-audit + thumbnailer toolkit.

Attribution & references

Where the design decisions in this sketch come from. Inline [N] markers in the pitfalls section above point at these.

Specs (load-bearing)

Sitemap protocol — https://www.sitemaps.org/protocol.html Canonical XML schema, the 50,000-URL and 50 MB-uncompressed per-file limits, and the <sitemapindex>-of-<sitemap> shape for index files. Source of the namespace strings and the <lastmod> W3C-datetime requirement.
Sitemap image extension — https://www.sitemaps.org/protocol.html#image The <image:image> namespace (xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"), required <image:loc> child, optional <image:caption> / <image:title> / <image:license> / <image:geo_location>.
robots.txt — RFC 9309, https://datatracker.ietf.org/doc/html/rfc9309 §2.2.4 mandates parsers MUST ignore unknown directives (basis for the Content-Signal-isn’t-broken finding earlier today). §2.5 establishes the Sitemap: directive convention used at the end of robots.txt.

Google guidance (consumer of the sitemap)

Sitemaps overview — https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview Establishes that Google treats sitemaps as hints, not directives; that <changefreq> and <priority> are largely ignored (well-documented across years of Google Search Central guidance); and that conflicts with robots.txt degrade the sitemap’s overall trust.
Build a sitemap — https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap Specifies UTF-8 encoding requirement, URL-entity-encoding for & etc., and the canonical-URL discipline (no 301 sources).
Image sitemap docs — https://developers.google.com/search/docs/crawling-indexing/sitemaps/image-sitemaps Confirms one image-block per page (don’t dedupe across pages), and that <image:caption> from alt is a useful signal.
Sitemap formats supported — https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap#text XML, RSS/Atom, and plain text formats are accepted; XML is the only one supporting the image extension.

Cloudflare-specific

Workers static assets — https://developers.cloudflare.com/workers/static-assets/ The deploy model dare.co.uk uses (Workers-with-Assets, not classic Pages); .assetsignore semantics (referenced in this session’s 72 MB-cut commit 6252b77e); deploy-by-wrangler versions upload.
Cloudflare Pages — https://developers.cloudflare.com/pages/ The model dare-dev-reports uses (publish surface for devreports.dare.co.uk); --branch=main as the production-promotion mechanism.
Cloudflare R2 — https://developers.cloudflare.com/r2/ The bucket model used for dare-images, the now-retired edge, and the dogwood/audrey successors. S3-compatible API used by dare_s3_to_r2_promote.py and today’s dare_wp_uploads_to_r2.py.
Cloudflare Image Resizing (cdn-cgi/image/) — https://developers.cloudflare.com/images/transform-images/ The /cdn-cgi/image/format=auto,quality=85/<source> transform syntax encountered in the 3 Tier 1 articles before today’s rewrite. Confirms the script should resolve these back to the canonical CDN URL before emitting into a sitemap.
Cloudflare Cache — https://developers.cloudflare.com/cache/concepts/cache-control/ The Cache-Control: public, max-age=31536000, immutable headers dare_s3_to_r2_promote.py sets — basis for the “long HEAD cache TTL is fine” decision (URLs are stable by content-addressed-naming convention).

Internal memory references (in-house lessons)

feedback_python_urllib_ua_cloudflare.md — CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF bot management 403s default Python-urllib/<ver>. Drives the Mozilla-UA discipline on every HEAD probe.
feedback_referrer_policy_for_cross_origin_images.md — images.dare.co.uk hotlink-protects against non-dare.co.uk Referer. Drives the no-Referer probe choice.
feedback_screenshot_error_page_guards.md — CDN, security layer, and DNS provider sitting in front of dare.co.uk." data-tip="Cloudflare — the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages serves 200 + custom-error-body for unknown paths; HEAD-200 alone isn’t proof. Drives the body-fingerprint check optionally.
feedback_seo_image_naming_convention.md — five rules for image filenames on CDNs; sitemap should never rename images mid-flight (basename-mode discipline from dare_s3_to_r2_promote_basename_2026-05-11).
feedback_check_devreports_before_infra_gap_analysis.md — sweep ~/Downloads/dare_* + ~/bin/dare_* + memory before re-deriving. Verified there’s no existing dare_sitemap_* script before starting this build.
feedback_save_vs_find_alignment.md — REPORT_PATTERNS is the find-time query; emit basenames that match (the *_sitemap_regen_* pattern this script will need).
feedback_park_with_resume_conditions.md — resume conditions structure (named below).
feedback_layered_guardrail_stack.md — credential layering (no secrets needed here; layer 1 alone suffices for a local walk + public HEAD probes).

Tooling references (the existing toolkit this script joins)

~/bin/dare_s3_to_r2_promote.py — argparse + dry-run-default + dated-report shape. Closest sibling.
~/bin/dare_wp_uploads_to_r2.py — today’s build; same shape, smaller scope.
~/bin/dare_dev_reports_publish.py — the publishing micro-service; REPORT_PATTERNS allowlist this script must join.
~/bin/dare_404_audit.py — repo-walk-by-glob pattern, HEAD-check pattern.
~/bin/dare_dev_reports_refresh.sh — the cron-wrapper pattern; sentinel-based op-injection re-exec (not needed here since no secrets, but worth matching shape for future portfolio variants that may need GHA secrets).

Resume conditions

Build when one of these triggers: - post-sitemap.xml reaches embarrassment threshold (currently 506 dead entries; if image-search referral traffic to dare ever materialises, the rotting sitemap becomes a measurable cost) - A second portfolio site (dogwood or audrey) needs a sitemap and we’d otherwise hand-roll it (build once, run twice) - A client engagement requires sitemap generation from a static repo (commercial trigger) - Cumulative hand-edited entries (the 3 fixed today + any future) exceed 20 (manual cost > build cost)

Until then: parked. Per feedback_park_with_resume_conditions.md.

Build once, run across the portfolio. The proper resolution to a 535-entry sitemap is not 535 edits — it’s the function that emits 535 entries from the source of truth.

Source: dare_sitemap_regen_sketch_2026-05-14.md · Rendered 2026-07-08 01:47 UTC

Built with — component scripts

seo_render_html.py — wraps the source .md in the dash.gf.cx design language (+ anchor_enricher.py for inline-link promotion & rollover thumbnails)
dare_dev_reports_publish.py — bundles the day’s reports into the catalog and ships to dash.gf.cx/reports