STASH placeholder leak in seo_render_html — parked for idle-time investigation
DARE.CO.UK · PARKED SKETCH · 2026-05-18
Mirrored from ~/.claude/.../memory/project_stash_placeholder_leak_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
\x00STASH
\x00 placeholders from _autolink_bare_urls leak into rendered HTML inside blocks (browser shows them as ▢STASH53▢ etc). Affects session reports + any md with fenced code blocks. Parked 2026-05-15; resume in idle time
Parked 2026-05-15 noon — Dan: “Park it for the next pass, write a note to memory, process it at night, or during idle time.” The Judge.me playbook and other immediate work took priority.
Symptom
dare_session_report_2026-05-15.html (and probably other reports with fenced code blocks) renders three sections as:
<pre>▢STASH53▢</pre>
<pre>▢STASH56▢</pre>
<pre>▢STASH57▢</pre>
Where ▢ is the browser’s rendering of \x00 (U+0000) as U+FFFD (replacement character) per HTML5 spec.
Affected sections in the session report: Git activity, Staged for promotion, Files touched today — the three sections that contained fenced-code (triple-backtick) blocks. Live evidence: https://devreports.dare.co.uk/dare_session_report_2026-05-15 — scroll to those sections.
What’s established
- The placeholder syntax
\x00STASH<n>\x00is created by~/bin/seo_render_html.py::_autolink_bare_urls()(line 533) - It’s the ONLY place in
seo_render_html.pythat usesSTASHas a placeholder prefix (verified by grep) - A sibling stash system
_annotate_jargon()uses\x00JG_PH_<n>\x00— different prefix, not the source python-markdownitself uses\x02wzxhzdk:<n>\x03(STX/ETX delimiters) — different from STASH- The function logic LOOKS correct on paper: stash
<a>,<code>,<pre>tags in that order, then unwind in reverse-sort - The function HAS shipped working for months — only certain reports leak
- The 3 specific leaks (53, 56, 57) correspond to the 3 fenced-code blocks in the session report
- An earlier instrumented re-render had a bug (
md_to_htmlreturns(title, body)not(body, title)) that produced a 38-byte “body” — invalidates the first diagnosis
Hypotheses to test on resume
- Reverse-sort lexicographic ordering issue. The unwind iterates
sorted(placeholders.keys(), reverse=True). With 100+ placeholders, lexicographic order interleaves digit-counts oddly (STASH5 sorts AFTER STASH50). Theory: a placeholder gets processed before its parent has been expanded, so the placeholder doesn’t yet exist in the html. Possible fix: sort numerically by extracted integer - Regex non-greedy match grabbing the wrong span. The
<pre\b[^>]*>.*?</pre>regex might match across blocks if HTML has unusual structure. Test by feeding the actual session-report HTML through the regex and inspecting matches - Placeholder dict miss — a stash call didn’t run. Maybe the
<code>regex doesn’t match certain<code>blocks (multi-line content with specific characters?). Test: add a print inside_stashto log every key as it’s assigned, compare against the dict at unwind time - Encoding round-trip mangling.
\x00is technically valid in Python strings but might be filtered when written to disk or read back. Less likely since the file DOES contain null bytes (browser renders them as▢), but worth checking the bytes around STASH53 in the on-disk html - Order-of-passes interaction with
_annotate_jargon. The jargon function stashes/unstashes<pre>blocks too. If it leaves any residual state, _autolink_bare_urls might see different html than expected. Test by skipping_annotate_jargonand re-rendering
Reproduction recipe
import sys
sys.path.insert(0, '/Users/dansellars/bin')
import seo_render_html as srh
from pathlib import Path
md = Path('/Users/dansellars/Downloads/dare_session_report_2026-05-15.md').read_text()
title, body = srh.md_to_html(md) # NOTE: returns (title, body) NOT (body, title)
print('STASH leaks:', body.count(chr(0)+'STASH'))
# Expected: 0; actual: 3 (or more if other code blocks also leak)
Then dig into _autolink_bare_urls with instrumentation — log every _stash key, log dict size at unwind, log remaining placeholders in html post-unwind.
Fix approaches considered
| Approach | Pros | Cons |
|---|---|---|
| Numeric sort of placeholder keys instead of lexicographic | Surgical, addresses hypothesis #1 | Doesn’t fix if root cause is hypothesis #2 or #3 |
Switch delimiter from \x00 to printable ASCII (e.g. __STASH_53__ or UUID per render) |
Survives any encoding round-trip; visible in diagnostics | Wider regex impact; may match user content if not careful |
| Switch stash mechanism to BeautifulSoup or html.parser (DOM-based, not regex) | Robust against regex edge cases; cleaner code | Larger refactor; possible perf regression |
| Defensive cleanup pass after unwind | Cheap, catches any leaks regardless of cause | Treats symptom not cause; might hide real bugs |
Combined recommendation: switch delimiter to a safe ASCII pattern AND add numeric sort — fixes hypotheses #1, #4 simultaneously, makes the system more diagnosable. Defer DOM-based refactor unless this fix doesn’t hold.
Where to resume
~/bin/seo_render_html.py—_autolink_bare_urlsfunction (lines 518-570)- Instrumented reproduction: see recipe above
- Test against the session report markdown specifically; if fixed, verify against
dare_session_report_2026-05-14.md(yesterday’s) and any other report with fenced code blocks
Watch items / scope
- Not blocking anything user-facing right now — the 3 leaks are inside
<pre>tags so the browser shows them as garbled-but-readable content. Not pretty, but readers can scroll past - Affects which reports? — any md with triple-backtick fenced code. Today’s session report is the visible one; others (older ones) may have the same bug if they had fenced code
- Tomorrow’s session report will probably show the same bug if it’s not fixed first — fenced code blocks in git activity / staged sections are part of the auto-generated spine
- Could batch-fix old reports by re-rendering them all once the bug is fixed, then republishing. Cheap one-off cleanup pass
Linked memories
feedback_publish_pattern_allowlist_gotcha— broader publish-pipeline gotcha collection; this is another instancefeedback_html_table_regex_gotcha— sibling lesson: presentation-layer regex assumptions don’t always survive the render transformfeedback_isolated_rendering_inherit_trap— different bug, same family (rendering pipeline assumptions)
Resume condition: free 30-60 min window. Instrument first, fix second. Verify against all reports with fenced code, not just today’s.