dare_sitemap_regen.py — architecture sketch · 2026-05-14

Parked workstream. Portfolio-portable static-site sitemap regenerator that emits valid sitemap.xml + post-sitemap.xml + page-sitemap.xml from a static repo. Designed to retire the WP-era frozen-in-amber post-sitemap.xml (last edited 2026-05-07; carries 506 dead http://edge.dare.co.uk/wp/* entries). Lives in ~/bin/ so the same script handles dare → dogwood → audrey → client work via config swap.

Genesis: today’s Edge-cohort migration left the sitemap stale (3 entries fixed by hand, commit 24b8736e). Doing that 506 more times by hand is a non-starter; generating it from current repo state is the right substrate.

What it does

One walk of the repo, three outputs:

  1. sitemap.xml — the index. References the children with current <lastmod> values.
  2. post-sitemap.xml — article-tree <url> blocks (one per migrated article), each carrying <image:image> children for body imagery.
  3. page-sitemap.xml — top-level static pages (/contact/, /privacy-policy/, policy pages, the four section listings).

Auto-shards if >= 45,000 URLs or >= 45 MB. Auto-skips if no diff vs deployed sitemap. Emits a dated report to ~/Downloads/dare_sitemap_regen_<date>.md per the always-publish rule.

Inputs (config-driven, never hardcoded)

A per-site YAML at ~/.config/dare-sitemap/<site>.yaml:

repo_root:       /Users/dansellars/Code/dare-co-uk
canonical_base:  https://www.dare.co.uk
image_cdn_base:  https://images.dare.co.uk
excludes:
  - wp-content/**
  - wp-includes/**
  - "**/*.bak-*"
  - "**/index.html.bak-*"
  - error-404/**
  - dare_migrate_failures_*.txt
redirect_file:   _redirects        # source-column entries get skipped from sitemap
post_paths:                         # globs that go into post-sitemap.xml
  - architecture/*/index.html
  - cinema/*/index.html
  - methods-of-business-design/*/index.html
  - culture-means-thriving-teams/*/index.html
  - field-notes-from-business-design/*/index.html
  - daring-acts/*/index.html
  - observations/*/index.html
  - books/*/index.html
  - brands/*/index.html
  - albums/*/index.html
  - photography/*/index.html
  - users/*/index.html
  - archive/*/index.html
page_paths:                         # globs that go into page-sitemap.xml
  - contact.html
  - privacy-policy/**
  - anti-spam-policy/**
  - dmca-policy/**
  - sitemap/**
  - "*/archive/index.html"          # section root archives
  - methods-of-business-design/index.html
  - culture-means-thriving-teams/index.html
  - field-notes-from-business-design/index.html
  - daring-acts/index.html
homepage:        index.html         # treated as a top-level "page"

Dogwood / audrey / client engagements each get their own YAML; the script is otherwise identical.

The walk (one function per concern)

collect(config) →
    for each html_path in walk(repo, includes=post+page, excludes):
        parsed = lxml.html.parse(html_path)
        canonical = extract_canonical(parsed)        # <link rel="canonical">
        lastmod   = extract_lastmod(html_path, parsed)
            ↪ priority: JSON-LD dateModified > article:modified_time
                       > git log -1 --format=%cI -- <file> > file mtime
        images    = extract_images(parsed)           # see classify_images() below
        yield Entry(canonical, lastmod, images, is_post=in_post_globs)

render(entries) →
    sort by canonical URL                            # stable output
    split into post/page/maybe-image-only sitemaps
    serialise via ElementTree (escapes &, no BOM, LF endings)
    write to repo_root/<sitemap>.xml

validate(written) →
    xmllint --noout if available
    head-check each <loc> via Mozilla UA + cached results
    emit count summary to ~/Downloads/dare_sitemap_regen_<date>.md

Pitfalls + traps (the meat of the sketch)

Category A — Lastmod & timestamps

Trap What goes wrong Mitigation
File mtime is unreliable Branch switches, cp -p drops, repo moves (~/Downloads→~/Code 2026-05-07) all clobber mtime. Prefer JSON-LD dateModified (article-embedded, survives copies). Fall back to git log -1 --format=%cI --follow <file>. mtime is last resort only.
Future timestamps Clock skew between local Mac + CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages runners + git committer date can emit lastmod values ahead of now(). Google’s parser warns / drops these. Clamp lastmod to min(extracted, now() - 60s).
Timezone-naive strings 2026-05-14T13:00 (no zone) is technically invalid per W3C datetime. Always emit +00:00 suffix; convert all sources to UTC before emit.
Per-file git log is slow 700+ articles × git log -1 = a few seconds per run. Survivable but noisy on cron. Cache last-known-git-mtime per file in ~/.cache/dare-sitemap/git-mtime.json, invalidate on git HEAD change.

Category B — Image discovery

Trap What goes wrong Mitigation
Regex-based <img> parse Multi-line attributes (<img\n src="..."), HTML entities, conditional comments, embedded SVG break naïve regex. The August 2026-05-06 WP migration script ate an hour to this. Use lxml.html.parse not regex. The investment compounds across audit + sitemap + 404-audit tools.
/cdn-cgi/image/... transforms in <img src> Three Edge pages had src="/cdn-cgi/image/format=auto,quality=85/wp-content/uploads/edge/X.jpg". Emitting that into a sitemap leaks Cloudflare’s edge transform path. Resolve cdn-cgi prefixes back to canonical CDN URLs via a small regex: /cdn-cgi/image/[^/]+/(.+)<image_cdn_base>/<group>. Same routine handles format=auto,quality=85 variants.
Favicons & publisher logos as image entries <link rel="icon" href="..."> + JSON-LD publisher.logo reference cropped-ziiiro-celeste.jpeg / snapshot.jpg on every page. Sitemap would have 700+ duplicate entries. Allowlist body-image extraction sources (<img src> inside .article-body, og:image, JSON-LD image array). Skip favicon, publisher.logo, twitter:image (usually a dup of og:image).
Base64 inline images Any <img src="data:image/..."> shouldn’t appear in sitemap (it’s not a URL). Drop any src starting with data:.
<picture>/srcset multi-source Modern articles may use <picture> with multiple <source srcset="X.webp 1x, X@2x.webp 2x">. Each is a separate URL. Pick highest-resolution source per <picture> block (largest descriptor), emit once.
Cross-domain image references A dare article that references an audreyinc.com image — should that appear in dare’s sitemap? Google’s spec allows cross-domain image refs in a sitemap, but it’s a weak ownership signal. Default: keep but flag in the report. Add --strict-same-origin opt-in flag for cohorts where it matters.

Category C — URL canonicalisation

Trap What goes wrong Mitigation
/foo/index.html vs /foo/ Both serve the same content. Emitting both = duplicate-URL signal. Always strip /index.html suffix; always trailing-slash for directory-style.
301-source URLs in sitemap /about/ 301s to /. Emitting /about/ tells Google “this is a canonical URL” — contradicts the 301. Parse _redirects; skip source entries from sitemap. Emit only redirect targets.
Query strings <image:loc>https://x.co/img?w=200&h=300</image:loc> is invalid XML — & needs &amp;. Use ElementTree’s .text setter (auto-escapes). Never string-concat XML.
Mixed http/https Older WP exports sometimes have http:// URLs alongside https://. Sitemap mixing the schemes signals confusion. Force canonical_base scheme on all output.
Trailing whitespace in <loc> <loc> https://... </loc> — some validators choke. .strip() every extracted value before write.

Category D — XML correctness

Trap What goes wrong Mitigation
BOM at file start Some XML parsers stumble on a  BOM in front of <?xml?>. Open with encoding="utf-8" not utf-8-sig; verify first bytes.
CRLF endings Cross-platform repos can mix CRLF/LF. Sitemap-validator tools sometimes report odd column numbers. Hardcode \n line endings; pass newline="" to open() only when explicitly needed.
Hand-rolled XML String concatenation of <url><loc>...</loc>...</url> breaks on first odd character. Use xml.etree.ElementTree or lxml.etree. The 30 minutes invested here pays back forever.
Missing namespace declaration If <image:image> appears without xmlns:image="..." on the root, Google ignores all image entries silently. Always emit the full namespace block as the WP sitemap did:
<urlset xmlns="..." xmlns:image="...">.

Category E — Size & sharding

Trap What goes wrong Mitigation
Single sitemap > 50,000 URLs Hard Google limit. Whole sitemap dropped silently. dare won’t hit this (700 articles) but dogwood’s photo archive could. Auto-shard at 45k URLs → post-sitemap-1.xml, post-sitemap-2.xml; update index.
Single sitemap > 50 MB uncompressed Same rejection class. Same sharding logic; threshold = min(45k URLs, 45 MB).
Sitemap-index too deep Sitemaps-of-sitemaps-of-sitemaps fails some crawlers. Max one level of indirection: index → shards.

Category F — Robots.txt / _redirects coherence

Trap What goes wrong Mitigation
Disallowed URLs in sitemap robots.txt says Disallow: /wp-admin/ but sitemap lists /wp-admin/foo. Google interprets this as a contradictory signal and may demote the whole sitemap. Parse robots.txt (the Cloudflare-managed block + custom rules); skip any URL covered by Disallow: patterns.
Redirect targets that 301 again /old//new//newer/. Sitemap emitting /new/ is still wrong. Follow redirect chain to final 200; emit only the terminal URL. Cache the resolution.
Sitemap referenced from robots.txt but pointing wrong Sitemap: https://www.dare.co.uk/sitemap.xml line in robots.txt is separately maintained; if we rename or move the index, both need updating. Script verifies the robots.txt Sitemap: line matches its own output path. Warns (doesn’t auto-fix) on mismatch.

Category G — Idempotency & cron

Trap What goes wrong Mitigation
Run-to-run jitter from datetime.now() If lastmod gets stamped from now(), every run rewrites every entry. Git churn explodes. Lastmod is derived purely from inputs (JSON-LD / git log / mtime), never from now().
Map ordering differences between Python versions Dict iteration order shouldn’t matter (Python 3.7+ preserves insertion), but a refactor could re-introduce non-determinism. Always sorted(entries, key=lambda e: e.canonical) before serialise.
Cron writes every run even when no content changed Daily git commit even when zero diff = noise. diff against current repo_root/<sitemap>.xml; skip write + commit if byte-identical. Emit summary report regardless (per always-publish).
Cached HEAD results going stale A 7-day cache misses a CDN URL that just started 404ing. TTL of 1-7 days, configurable. CDN URLs are stable by design (immutable per cache-control headers from dare_s3_to_r2_promote.py), so longer is fine.

Category H — Portfolio portability

Trap What goes wrong Mitigation
Hardcoded dare.co.uk strings Every transfer to dogwood / audrey / client work requires sed-replace + careful diff. The whole point of the script breaks. All site-specific values in config. Script body never mentions a site name.
Per-portfolio image-CDN routing differs dare: images.dare.co.uk. dogwood: probably images.dogwood.house (TBD). audrey: images.audreyinc.com (TBD). Some sites may use a path-based CDN (example.com/cdn/...) instead of a hostname. Config supports both image_cdn_base (hostname) and image_cdn_prefix (path-rewrite rule). Default to the first; clients with weird setups override.
Different sitemap structures per platform Squarespace/Shopify already emit their own sitemap; replacing it is the wrong move. Cloudflare Pages static site = this script’s natural home. Script refuses to run if canonical_base resolves to a non-pages.dev / non-CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF-Worker origin (or override with --force).
Per-site article-vs-page split dare uses the WP convention (post/page); dogwood may not have an analogous distinction. post_paths / page_paths are config globs. A site with no page distinction sets page_paths: []; script emits single sitemap.

Category I — CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF / bot interactions

Trap What goes wrong Mitigation
Python-urllib/<ver> UA blocked by CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF bot management HEAD check returns 403 from CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF — script wrongly concludes the image is 404. Documented in feedback_python_urllib_ua_cloudflare.md; bit dare_s3_to_r2_promote.py on 2026-05-11. Always send a Mozilla-ish UA on probes: dare-pipeline-sitemap/1.0 (+Mozilla/5.0).
Hotlink protection on the image CDN images.dare.co.uk 403s requests with non-dare.co.uk Referer. Sitemap probes from local dev → 403 false-negative. Probe with no Referer set (CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF passes those; verified 2026-05-14 by cf-access). Document the constraint clearly.
Service-token-gated source sites A staging or preview surface behind CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access — script can’t HEAD-check from CI without injecting Access creds. --skip-image-validation flag for offline runs. Otherwise reach for the existing cf-access wrapper (~/bin/cf-access).
Headless probe = soft-404 mistaken for success Per feedback_screenshot_error_page_guards.md, CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages returns 200 + custom-error-body for unknown paths. HEAD-check sees 200; script trusts it. Match response body against known error-page fingerprints if probing pages (not images). Per-portfolio fingerprint list in config.

Category J — Image-sitemap-extension subtleties

Trap What goes wrong Mitigation
Same image referenced from N pages Bare <image:image> block under each page’s <url> is correct (it’s a page-image relationship, not a global-image listing). Don’t dedupe across pages. Per-page repetition is the spec.
<image:caption> from alt text Useful signal for image search. Long alt text = good; missing alt = empty caption. Include <image:caption> only when alt="" is non-empty and meaningful. Skip noisy alts (single-word, generic).
<image:license> for legal cohorts The DARE archive has historical images with mixed licensing. Asserting a license per image without provenance is risky. Omit <image:license> unless config provides a default + explicit allowlist.

What the report (the .md emitted each run) should say

Following the “so what + what next” report structure:

The report becomes the storytelling-substrate artefact for sitemap hygiene over time (per feedback_toolkit_as_storytelling_substrate.md).

Wire-up

Surface How
Manual run ~/bin/dare_sitemap_regen.py --site dare
CI / cron weekly via GHA in dare-pipeline repo; PR-only on diff (no auto-commit to main without staging-first per current discipline)
Devreports auto-publish via dare_dev_reports_refresh.sh; pattern added to REPORT_PATTERNS
Memory project_*_sitemap_regen.md per portfolio site, capturing per-site config + decisions
1Password none — no secrets required for the local walk; HEAD probes use no auth on public CDNs

Compounding across portfolio

Site Status Notes
dare.co.uk Genesis. ~700 articles + ~12 pages. WP-era sitemap to replace. First-mover, edge cases will drive script design.
dogwood.house Future. NYC/CT/Hamptons service. Will have photo galleries + service-area pages. Sharding likely needed once gallery archive grows.
audreyinc.com Future. Shopify-backed; Shopify emits its own sitemap, so this script applies only to the agent-discoverability gift-guide pages at /gift-guide/* (not the product catalog). Hybrid: Shopify sitemap + this script’s gift-guide sitemap, linked from the index.
dansellars.com Future. Personal site. Standard pattern.
Client engagements Future. Per-engagement YAML config. Same binary, different config; same compounding model as the existing audit + 404-audit + thumbnailer toolkit.

Attribution & references

Where the design decisions in this sketch come from. Inline [N] markers in the pitfalls section above point at these.

Specs (load-bearing)

Google guidance (consumer of the sitemap)

Cloudflare-specific

Internal memory references (in-house lessons)

Tooling references (the existing toolkit this script joins)

Resume conditions

Build when one of these triggers: - post-sitemap.xml reaches embarrassment threshold (currently 506 dead entries; if image-search referral traffic to dare ever materialises, the rotting sitemap becomes a measurable cost) - A second portfolio site (dogwood or audrey) needs a sitemap and we’d otherwise hand-roll it (build once, run twice) - A client engagement requires sitemap generation from a static repo (commercial trigger) - Cumulative hand-edited entries (the 3 fixed today + any future) exceed 20 (manual cost > build cost)

Until then: parked. Per feedback_park_with_resume_conditions.md.


Build once, run across the portfolio. The proper resolution to a 535-entry sitemap is not 535 edits — it’s the function that emits 535 entries from the source of truth.

Source: dare_sitemap_regen_sketch_2026-05-14.md · Rendered 2026-05-14 09:33