dare_sitemap_regen.py — architecture sketch · 2026-05-14
Parked workstream. Portfolio-portable static-site sitemap regenerator that emits valid sitemap.xml + post-sitemap.xml + page-sitemap.xml from a static repo. Designed to retire the WP-era frozen-in-amber post-sitemap.xml (last edited 2026-05-07; carries 506 dead http://edge.dare.co.uk/wp/* entries). Lives in ~/bin/ so the same script handles dare → dogwood → audrey → client work via config swap.
Genesis: today’s Edge-cohort migration left the sitemap stale (3 entries fixed by hand, commit 24b8736e). Doing that 506 more times by hand is a non-starter; generating it from current repo state is the right substrate.
What it does
One walk of the repo, three outputs:
sitemap.xml— the index. References the children with current<lastmod>values.post-sitemap.xml— article-tree<url>blocks (one per migrated article), each carrying<image:image>children for body imagery.page-sitemap.xml— top-level static pages (/contact/,/privacy-policy/, policy pages, the four section listings).
Auto-shards if >= 45,000 URLs or >= 45 MB. Auto-skips if no diff vs deployed sitemap. Emits a dated report to ~/Downloads/dare_sitemap_regen_<date>.md per the always-publish rule.
Inputs (config-driven, never hardcoded)
A per-site YAML at ~/.config/dare-sitemap/<site>.yaml:
repo_root: /Users/dansellars/Code/dare-co-uk
canonical_base: https://www.dare.co.uk
image_cdn_base: https://images.dare.co.uk
excludes:
- wp-content/**
- wp-includes/**
- "**/*.bak-*"
- "**/index.html.bak-*"
- error-404/**
- dare_migrate_failures_*.txt
redirect_file: _redirects # source-column entries get skipped from sitemap
post_paths: # globs that go into post-sitemap.xml
- architecture/*/index.html
- cinema/*/index.html
- methods-of-business-design/*/index.html
- culture-means-thriving-teams/*/index.html
- field-notes-from-business-design/*/index.html
- daring-acts/*/index.html
- observations/*/index.html
- books/*/index.html
- brands/*/index.html
- albums/*/index.html
- photography/*/index.html
- users/*/index.html
- archive/*/index.html
page_paths: # globs that go into page-sitemap.xml
- contact.html
- privacy-policy/**
- anti-spam-policy/**
- dmca-policy/**
- sitemap/**
- "*/archive/index.html" # section root archives
- methods-of-business-design/index.html
- culture-means-thriving-teams/index.html
- field-notes-from-business-design/index.html
- daring-acts/index.html
homepage: index.html # treated as a top-level "page"
Dogwood / audrey / client engagements each get their own YAML; the script is otherwise identical.
The walk (one function per concern)
collect(config) →
for each html_path in walk(repo, includes=post+page, excludes):
parsed = lxml.html.parse(html_path)
canonical = extract_canonical(parsed) # <link rel="canonical">
lastmod = extract_lastmod(html_path, parsed)
↪ priority: JSON-LD dateModified > article:modified_time
> git log -1 --format=%cI -- <file> > file mtime
images = extract_images(parsed) # see classify_images() below
yield Entry(canonical, lastmod, images, is_post=in_post_globs)
render(entries) →
sort by canonical URL # stable output
split into post/page/maybe-image-only sitemaps
serialise via ElementTree (escapes &, no BOM, LF endings)
write to repo_root/<sitemap>.xml
validate(written) →
xmllint --noout if available
head-check each <loc> via Mozilla UA + cached results
emit count summary to ~/Downloads/dare_sitemap_regen_<date>.md
Pitfalls + traps (the meat of the sketch)
Category A — Lastmod & timestamps
| Trap | What goes wrong | Mitigation |
|---|---|---|
| File mtime is unreliable | Branch switches, cp -p drops, repo moves (~/Downloads→~/Code 2026-05-07) all clobber mtime. |
Prefer JSON-LD dateModified (article-embedded, survives copies). Fall back to git log -1 --format=%cI --follow <file>. mtime is last resort only. |
| Future timestamps | Clock skew between local Mac + CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages runners + git committer date can emit lastmod values ahead of now(). Google’s parser warns / drops these. |
Clamp lastmod to min(extracted, now() - 60s). |
| Timezone-naive strings | 2026-05-14T13:00 (no zone) is technically invalid per W3C datetime. |
Always emit +00:00 suffix; convert all sources to UTC before emit. |
| Per-file git log is slow | 700+ articles × git log -1 = a few seconds per run. Survivable but noisy on cron. |
Cache last-known-git-mtime per file in ~/.cache/dare-sitemap/git-mtime.json, invalidate on git HEAD change. |
Category B — Image discovery
| Trap | What goes wrong | Mitigation |
|---|---|---|
Regex-based <img> parse |
Multi-line attributes (<img\n src="..."), HTML entities, conditional comments, embedded SVG break naïve regex. The August 2026-05-06 WP migration script ate an hour to this. |
Use lxml.html.parse not regex. The investment compounds across audit + sitemap + 404-audit tools. |
/cdn-cgi/image/... transforms in <img src> |
Three Edge pages had src="/cdn-cgi/image/format=auto,quality=85/wp-content/uploads/edge/X.jpg". Emitting that into a sitemap leaks Cloudflare’s edge transform path. |
Resolve cdn-cgi prefixes back to canonical CDN URLs via a small regex: /cdn-cgi/image/[^/]+/(.+) → <image_cdn_base>/<group>. Same routine handles format=auto,quality=85 variants. |
| Favicons & publisher logos as image entries | <link rel="icon" href="..."> + JSON-LD publisher.logo reference cropped-ziiiro-celeste.jpeg / snapshot.jpg on every page. Sitemap would have 700+ duplicate entries. |
Allowlist body-image extraction sources (<img src> inside .article-body, og:image, JSON-LD image array). Skip favicon, publisher.logo, twitter:image (usually a dup of og:image). |
| Base64 inline images | Any <img src="data:image/..."> shouldn’t appear in sitemap (it’s not a URL). |
Drop any src starting with data:. |
<picture>/srcset multi-source |
Modern articles may use <picture> with multiple <source srcset="X.webp 1x, X@2x.webp 2x">. Each is a separate URL. |
Pick highest-resolution source per <picture> block (largest descriptor), emit once. |
| Cross-domain image references | A dare article that references an audreyinc.com image — should that appear in dare’s sitemap? Google’s spec allows cross-domain image refs in a sitemap, but it’s a weak ownership signal. |
Default: keep but flag in the report. Add --strict-same-origin opt-in flag for cohorts where it matters. |
Category C — URL canonicalisation
| Trap | What goes wrong | Mitigation |
|---|---|---|
/foo/index.html vs /foo/ |
Both serve the same content. Emitting both = duplicate-URL signal. | Always strip /index.html suffix; always trailing-slash for directory-style. |
| 301-source URLs in sitemap | /about/ 301s to /. Emitting /about/ tells Google “this is a canonical URL” — contradicts the 301. |
Parse _redirects; skip source entries from sitemap. Emit only redirect targets. |
| Query strings | <image:loc>https://x.co/img?w=200&h=300</image:loc> is invalid XML — & needs &. |
Use ElementTree’s .text setter (auto-escapes). Never string-concat XML. |
| Mixed http/https | Older WP exports sometimes have http:// URLs alongside https://. Sitemap mixing the schemes signals confusion. |
Force canonical_base scheme on all output. |
Trailing whitespace in <loc> |
<loc> https://... </loc> — some validators choke. |
.strip() every extracted value before write. |
Category D — XML correctness
| Trap | What goes wrong | Mitigation |
|---|---|---|
| BOM at file start | Some XML parsers stumble on a BOM in front of <?xml?>. |
Open with encoding="utf-8" not utf-8-sig; verify first bytes. |
| CRLF endings | Cross-platform repos can mix CRLF/LF. Sitemap-validator tools sometimes report odd column numbers. | Hardcode \n line endings; pass newline="" to open() only when explicitly needed. |
| Hand-rolled XML | String concatenation of <url><loc>...</loc>...</url> breaks on first odd character. |
Use xml.etree.ElementTree or lxml.etree. The 30 minutes invested here pays back forever. |
| Missing namespace declaration | If <image:image> appears without xmlns:image="..." on the root, Google ignores all image entries silently. |
Always emit the full namespace block as the WP sitemap did:<urlset xmlns="..." xmlns:image="...">. |
Category E — Size & sharding
| Trap | What goes wrong | Mitigation |
|---|---|---|
| Single sitemap > 50,000 URLs | Hard Google limit. Whole sitemap dropped silently. | dare won’t hit this (700 articles) but dogwood’s photo archive could. Auto-shard at 45k URLs → post-sitemap-1.xml, post-sitemap-2.xml; update index. |
| Single sitemap > 50 MB uncompressed | Same rejection class. | Same sharding logic; threshold = min(45k URLs, 45 MB). |
| Sitemap-index too deep | Sitemaps-of-sitemaps-of-sitemaps fails some crawlers. | Max one level of indirection: index → shards. |
Category F — Robots.txt / _redirects coherence
| Trap | What goes wrong | Mitigation |
|---|---|---|
| Disallowed URLs in sitemap | robots.txt says Disallow: /wp-admin/ but sitemap lists /wp-admin/foo. Google interprets this as a contradictory signal and may demote the whole sitemap. |
Parse robots.txt (the Cloudflare-managed block + custom rules); skip any URL covered by Disallow: patterns. |
| Redirect targets that 301 again | /old/ → /new/ → /newer/. Sitemap emitting /new/ is still wrong. |
Follow redirect chain to final 200; emit only the terminal URL. Cache the resolution. |
| Sitemap referenced from robots.txt but pointing wrong | Sitemap: https://www.dare.co.uk/sitemap.xml line in robots.txt is separately maintained; if we rename or move the index, both need updating. |
Script verifies the robots.txt Sitemap: line matches its own output path. Warns (doesn’t auto-fix) on mismatch. |
Category G — Idempotency & cron
| Trap | What goes wrong | Mitigation |
|---|---|---|
Run-to-run jitter from datetime.now() |
If lastmod gets stamped from now(), every run rewrites every entry. Git churn explodes. |
Lastmod is derived purely from inputs (JSON-LD / git log / mtime), never from now(). |
| Map ordering differences between Python versions | Dict iteration order shouldn’t matter (Python 3.7+ preserves insertion), but a refactor could re-introduce non-determinism. | Always sorted(entries, key=lambda e: e.canonical) before serialise. |
| Cron writes every run even when no content changed | Daily git commit even when zero diff = noise. | diff against current repo_root/<sitemap>.xml; skip write + commit if byte-identical. Emit summary report regardless (per always-publish). |
| Cached HEAD results going stale | A 7-day cache misses a CDN URL that just started 404ing. | TTL of 1-7 days, configurable. CDN URLs are stable by design (immutable per cache-control headers from dare_s3_to_r2_promote.py), so longer is fine. |
Category H — Portfolio portability
| Trap | What goes wrong | Mitigation |
|---|---|---|
Hardcoded dare.co.uk strings |
Every transfer to dogwood / audrey / client work requires sed-replace + careful diff. The whole point of the script breaks. | All site-specific values in config. Script body never mentions a site name. |
| Per-portfolio image-CDN routing differs | dare: images.dare.co.uk. dogwood: probably images.dogwood.house (TBD). audrey: images.audreyinc.com (TBD). Some sites may use a path-based CDN (example.com/cdn/...) instead of a hostname. |
Config supports both image_cdn_base (hostname) and image_cdn_prefix (path-rewrite rule). Default to the first; clients with weird setups override. |
| Different sitemap structures per platform | Squarespace/Shopify already emit their own sitemap; replacing it is the wrong move. Cloudflare Pages static site = this script’s natural home. | Script refuses to run if canonical_base resolves to a non-pages.dev / non-CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF-Worker origin (or override with --force). |
| Per-site article-vs-page split | dare uses the WP convention (post/page); dogwood may not have an analogous distinction. | post_paths / page_paths are config globs. A site with no page distinction sets page_paths: []; script emits single sitemap. |
Category I — CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF / bot interactions
| Trap | What goes wrong | Mitigation |
|---|---|---|
Python-urllib/<ver> UA blocked by CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF bot management |
HEAD check returns 403 from CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF — script wrongly concludes the image is 404. Documented in feedback_python_urllib_ua_cloudflare.md; bit dare_s3_to_r2_promote.py on 2026-05-11. |
Always send a Mozilla-ish UA on probes: dare-pipeline-sitemap/1.0 (+Mozilla/5.0). |
| Hotlink protection on the image CDN | images.dare.co.uk 403s requests with non-dare.co.uk Referer. Sitemap probes from local dev → 403 false-negative. |
Probe with no Referer set (CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF passes those; verified 2026-05-14 by cf-access). Document the constraint clearly. |
| Service-token-gated source sites | A staging or preview surface behind CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access — script can’t HEAD-check from CI without injecting Access creds. | --skip-image-validation flag for offline runs. Otherwise reach for the existing cf-access wrapper (~/bin/cf-access). |
| Headless probe = soft-404 mistaken for success | Per feedback_screenshot_error_page_guards.md, CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages returns 200 + custom-error-body for unknown paths. HEAD-check sees 200; script trusts it. |
Match response body against known error-page fingerprints if probing pages (not images). Per-portfolio fingerprint list in config. |
Category J — Image-sitemap-extension subtleties
| Trap | What goes wrong | Mitigation |
|---|---|---|
| Same image referenced from N pages | Bare <image:image> block under each page’s <url> is correct (it’s a page-image relationship, not a global-image listing). |
Don’t dedupe across pages. Per-page repetition is the spec. |
<image:caption> from alt text |
Useful signal for image search. Long alt text = good; missing alt = empty caption. | Include <image:caption> only when alt="" is non-empty and meaningful. Skip noisy alts (single-word, generic). |
<image:license> for legal cohorts |
The DARE archive has historical images with mixed licensing. Asserting a license per image without provenance is risky. | Omit <image:license> unless config provides a default + explicit allowlist. |
What the report (the .md emitted each run) should say
Following the “so what + what next” report structure:
- TL;DR — N pages indexed, N images cited, sitemap size before/after, run duration
- What changed — diff summary vs the previous run (new pages, dropped pages, lastmod-only updates)
- Watch items — pages with no canonical URL, pages with no images, pages with future-dated
lastmod, pages still pointing at dead hosts - Recommendations — orphan articles (in tree but not linked from anywhere), image-search candidates (pages with strong images + weak traffic per GSC join), etc.
The report becomes the storytelling-substrate artefact for sitemap hygiene over time (per feedback_toolkit_as_storytelling_substrate.md).
Wire-up
| Surface | How |
|---|---|
| Manual run | ~/bin/dare_sitemap_regen.py --site dare |
| CI / cron | weekly via GHA in dare-pipeline repo; PR-only on diff (no auto-commit to main without staging-first per current discipline) |
| Devreports | auto-publish via dare_dev_reports_refresh.sh; pattern added to REPORT_PATTERNS |
| Memory | project_*_sitemap_regen.md per portfolio site, capturing per-site config + decisions |
| 1Password | none — no secrets required for the local walk; HEAD probes use no auth on public CDNs |
Compounding across portfolio
| Site | Status | Notes |
|---|---|---|
dare.co.uk |
Genesis. ~700 articles + ~12 pages. WP-era sitemap to replace. | First-mover, edge cases will drive script design. |
dogwood.house |
Future. NYC/CT/Hamptons service. Will have photo galleries + service-area pages. | Sharding likely needed once gallery archive grows. |
audreyinc.com |
Future. Shopify-backed; Shopify emits its own sitemap, so this script applies only to the agent-discoverability gift-guide pages at /gift-guide/* (not the product catalog). |
Hybrid: Shopify sitemap + this script’s gift-guide sitemap, linked from the index. |
dansellars.com |
Future. Personal site. | Standard pattern. |
| Client engagements | Future. Per-engagement YAML config. | Same binary, different config; same compounding model as the existing audit + 404-audit + thumbnailer toolkit. |
Attribution & references
Where the design decisions in this sketch come from. Inline [N] markers in the pitfalls section above point at these.
Specs (load-bearing)
- Sitemap protocol — https://www.sitemaps.org/protocol.html
Canonical XML schema, the 50,000-URL and 50 MB-uncompressed per-file limits, and the
<sitemapindex>-of-<sitemap>shape for index files. Source of the namespace strings and the<lastmod>W3C-datetime requirement. - Sitemap image extension — https://www.sitemaps.org/protocol.html#image
The
<image:image>namespace (xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"), required<image:loc>child, optional<image:caption>/<image:title>/<image:license>/<image:geo_location>. - robots.txt — RFC 9309, https://datatracker.ietf.org/doc/html/rfc9309
§2.2.4 mandates parsers MUST ignore unknown directives (basis for the Content-Signal-isn’t-broken finding earlier today). §2.5 establishes the
Sitemap:directive convention used at the end of robots.txt.
Google guidance (consumer of the sitemap)
- Sitemaps overview — https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview
Establishes that Google treats sitemaps as hints, not directives; that
<changefreq>and<priority>are largely ignored (well-documented across years of Google Search Central guidance); and that conflicts with robots.txt degrade the sitemap’s overall trust. - Build a sitemap — https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap
Specifies UTF-8 encoding requirement, URL-entity-encoding for
&etc., and the canonical-URL discipline (no 301 sources). - Image sitemap docs — https://developers.google.com/search/docs/crawling-indexing/sitemaps/image-sitemaps
Confirms one image-block per page (don’t dedupe across pages), and that
<image:caption>fromaltis a useful signal. - Sitemap formats supported — https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap#text XML, RSS/Atom, and plain text formats are accepted; XML is the only one supporting the image extension.
Cloudflare-specific
- Workers static assets — https://developers.cloudflare.com/workers/static-assets/
The deploy model dare.co.uk uses (Workers-with-Assets, not classic Pages);
.assetsignoresemantics (referenced in this session’s 72 MB-cut commit6252b77e); deploy-by-wrangler versions upload. - Cloudflare Pages — https://developers.cloudflare.com/pages/
The model
dare-dev-reportsuses (publish surface for devreports.dare.co.uk);--branch=mainas the production-promotion mechanism. - Cloudflare R2 — https://developers.cloudflare.com/r2/
The bucket model used for
dare-images, the now-retirededge, and the dogwood/audrey successors. S3-compatible API used bydare_s3_to_r2_promote.pyand today’sdare_wp_uploads_to_r2.py. - Cloudflare Image Resizing (cdn-cgi/image/) — https://developers.cloudflare.com/images/transform-images/
The
/cdn-cgi/image/format=auto,quality=85/<source>transform syntax encountered in the 3 Tier 1 articles before today’s rewrite. Confirms the script should resolve these back to the canonical CDN URL before emitting into a sitemap. - Cloudflare Cache — https://developers.cloudflare.com/cache/concepts/cache-control/
The
Cache-Control: public, max-age=31536000, immutableheadersdare_s3_to_r2_promote.pysets — basis for the “long HEAD cache TTL is fine” decision (URLs are stable by content-addressed-naming convention).
Internal memory references (in-house lessons)
feedback_python_urllib_ua_cloudflare.md— CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF bot management 403s defaultPython-urllib/<ver>. Drives the Mozilla-UA discipline on every HEAD probe.feedback_referrer_policy_for_cross_origin_images.md— images.dare.co.uk hotlink-protects against non-dare.co.uk Referer. Drives the no-Referer probe choice.feedback_screenshot_error_page_guards.md— CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Pages serves 200 + custom-error-body for unknown paths; HEAD-200 alone isn’t proof. Drives the body-fingerprint check optionally.feedback_seo_image_naming_convention.md— five rules for image filenames on CDNs; sitemap should never rename images mid-flight (basename-mode discipline fromdare_s3_to_r2_promote_basename_2026-05-11).feedback_check_devreports_before_infra_gap_analysis.md— sweep~/Downloads/dare_*+~/bin/dare_*+ memory before re-deriving. Verified there’s no existingdare_sitemap_*script before starting this build.feedback_save_vs_find_alignment.md— REPORT_PATTERNS is the find-time query; emit basenames that match (the*_sitemap_regen_*pattern this script will need).feedback_park_with_resume_conditions.md— resume conditions structure (named below).feedback_layered_guardrail_stack.md— credential layering (no secrets needed here; layer 1 alone suffices for a local walk + public HEAD probes).
Tooling references (the existing toolkit this script joins)
~/bin/dare_s3_to_r2_promote.py— argparse + dry-run-default + dated-report shape. Closest sibling.~/bin/dare_wp_uploads_to_r2.py— today’s build; same shape, smaller scope.~/bin/dare_dev_reports_publish.py— the publishing micro-service; REPORT_PATTERNS allowlist this script must join.~/bin/dare_404_audit.py— repo-walk-by-glob pattern, HEAD-check pattern.~/bin/dare_dev_reports_refresh.sh— the cron-wrapper pattern; sentinel-based op-injection re-exec (not needed here since no secrets, but worth matching shape for future portfolio variants that may need GHA secrets).
Resume conditions
Build when one of these triggers:
- post-sitemap.xml reaches embarrassment threshold (currently 506 dead entries; if image-search referral traffic to dare ever materialises, the rotting sitemap becomes a measurable cost)
- A second portfolio site (dogwood or audrey) needs a sitemap and we’d otherwise hand-roll it (build once, run twice)
- A client engagement requires sitemap generation from a static repo (commercial trigger)
- Cumulative hand-edited entries (the 3 fixed today + any future) exceed 20 (manual cost > build cost)
Until then: parked. Per feedback_park_with_resume_conditions.md.
Build once, run across the portfolio. The proper resolution to a 535-entry sitemap is not 535 edits — it’s the function that emits 535 entries from the source of truth.