STASH placeholder leak in seo_render_html — parked for idle-time investigation

DARE.CO.UK · PARKED SKETCH · 2026-05-18

Mirrored from ~/.claude/.../memory/project_stash_placeholder_leak_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

\x00STASH\x00 placeholders from _autolink_bare_urls leak into rendered HTML inside

 blocks (browser shows them as ▢STASH53▢ etc). Affects session reports + any md with fenced code blocks. Parked 2026-05-15; resume in idle time


Parked 2026-05-15 noon — Dan: “Park it for the next pass, write a note to memory, process it at night, or during idle time.” The Judge.me playbook and other immediate work took priority.

Symptom

dare_session_report_2026-05-15.html (and probably other reports with fenced code blocks) renders three sections as:

<pre>▢STASH53▢</pre>
<pre>▢STASH56▢</pre>
<pre>▢STASH57▢</pre>

Where is the browser’s rendering of \x00 (U+0000) as U+FFFD (replacement character) per HTML5 spec.

Affected sections in the session report: Git activity, Staged for promotion, Files touched today — the three sections that contained fenced-code (triple-backtick) blocks. Live evidence: https://devreports.dare.co.uk/dare_session_report_2026-05-15 — scroll to those sections.

What’s established

Hypotheses to test on resume

  1. Reverse-sort lexicographic ordering issue. The unwind iterates sorted(placeholders.keys(), reverse=True). With 100+ placeholders, lexicographic order interleaves digit-counts oddly (STASH5 sorts AFTER STASH50). Theory: a placeholder gets processed before its parent has been expanded, so the placeholder doesn’t yet exist in the html. Possible fix: sort numerically by extracted integer
  2. Regex non-greedy match grabbing the wrong span. The <pre\b[^>]*>.*?</pre> regex might match across blocks if HTML has unusual structure. Test by feeding the actual session-report HTML through the regex and inspecting matches
  3. Placeholder dict miss — a stash call didn’t run. Maybe the <code> regex doesn’t match certain <code> blocks (multi-line content with specific characters?). Test: add a print inside _stash to log every key as it’s assigned, compare against the dict at unwind time
  4. Encoding round-trip mangling. \x00 is technically valid in Python strings but might be filtered when written to disk or read back. Less likely since the file DOES contain null bytes (browser renders them as ), but worth checking the bytes around STASH53 in the on-disk html
  5. Order-of-passes interaction with _annotate_jargon. The jargon function stashes/unstashes <pre> blocks too. If it leaves any residual state, _autolink_bare_urls might see different html than expected. Test by skipping _annotate_jargon and re-rendering

Reproduction recipe

import sys
sys.path.insert(0, '/Users/dansellars/bin')
import seo_render_html as srh
from pathlib import Path

md = Path('/Users/dansellars/Downloads/dare_session_report_2026-05-15.md').read_text()
title, body = srh.md_to_html(md)   # NOTE: returns (title, body) NOT (body, title)
print('STASH leaks:', body.count(chr(0)+'STASH'))
# Expected: 0; actual: 3 (or more if other code blocks also leak)

Then dig into _autolink_bare_urls with instrumentation — log every _stash key, log dict size at unwind, log remaining placeholders in html post-unwind.

Fix approaches considered

Approach Pros Cons
Numeric sort of placeholder keys instead of lexicographic Surgical, addresses hypothesis #1 Doesn’t fix if root cause is hypothesis #2 or #3
Switch delimiter from \x00 to printable ASCII (e.g. __STASH_53__ or UUID per render) Survives any encoding round-trip; visible in diagnostics Wider regex impact; may match user content if not careful
Switch stash mechanism to BeautifulSoup or html.parser (DOM-based, not regex) Robust against regex edge cases; cleaner code Larger refactor; possible perf regression
Defensive cleanup pass after unwind Cheap, catches any leaks regardless of cause Treats symptom not cause; might hide real bugs

Combined recommendation: switch delimiter to a safe ASCII pattern AND add numeric sort — fixes hypotheses #1, #4 simultaneously, makes the system more diagnosable. Defer DOM-based refactor unless this fix doesn’t hold.

Where to resume

  1. ~/bin/seo_render_html.py_autolink_bare_urls function (lines 518-570)
  2. Instrumented reproduction: see recipe above
  3. Test against the session report markdown specifically; if fixed, verify against dare_session_report_2026-05-14.md (yesterday’s) and any other report with fenced code blocks

Watch items / scope

Linked memories


Resume condition: free 30-60 min window. Instrument first, fix second. Verify against all reports with fenced code, not just today’s.

Source: parked_sketch_stash_placeholder_leak_2026-05-15.md · Rendered 2026-05-18 12:53