Extend vision-OCR pipeline to PDF receipts — Dan-flagged high priority (parked 2026-05-25)

DARE.CO.UK · PARKED SKETCH · 2026-07-15

Mirrored from ~/.claude/.../memory/parked_sketch_pdf_ocr_pipeline_extension_2026-05-25.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

The vision-OCR receipt pipeline (Haiku-on-images) currently emits {"_error": "pdf_not_supported_by_vision_api"} for any PDF receipt. Tonight’s Willow Tree discovery proved the cost — a $1,500 tree-care invoice that fully captured the vendor identity (Willow Tree and Landscape Services, LLC · 40 years tenure · Hatboro PA · phone · email · domain · service line items) was invisible to the substrate until Claude read the PDF manually. Dan: “Worth extending the OCR pipeline to PDFs we need this!” Fix shape: rasterize each PDF page via pdftoppm / pdf2image → run each page through the same Haiku vision endpoint the JPG/PNG pipeline uses → merge per-page JSON sidecars. Cost ~$0.001 per page via Haiku Batches. ~$1-3 to backfill the full Harvest PDF-receipt corpus. Multiple instances already proven valuable (Jorge identity via JPG-OCR; Willow Tree via manual PDF-read; ChipDrop receipt via Dan’s emailed-receipt screenshot).

Dan 2026-05-25: “100% — Worth extending the OCR pipeline to PDFs we need this!” — after seeing how a single PDF read revealed Willow Tree’s full identity (40 years, phone, website, email, service details) from a receipt the existing pipeline marked as pdf_not_supported_by_vision_api.

The current gap (concrete)

The vision-OCR receipt pipeline runs against JPG/PNG receipts and emits structured sidecar JSON (vendor, line items, totals, dates). For PDFs, the sidecar is just:

STASH5

So any receipt the property owner uploaded as PDF is invisible to: - Vendor identification (the contractor record-builder) - Line-item extraction (per-job spend granularity) - Cross-receipt grep (grep -r "Willow Tree" receipts/ returns 0 hits) - Future analytics (per-vendor totals, service-frequency)

Tonight’s proof point

The Willow Tree and Landscape Services discovery: - Harvest entry said only "Tree Surgery" as vendor - Receipt was a PDF: Invoice_42138_6279_Greenhill_Road.pdf - Pipeline sidecar: {"_error": "pdf_not_supported_by_vision_api"} - Manual Read of the PDF surfaced: company name (Willow Tree and Landscape Services, LLC), 40-year tenure tagline (“Rooting Relationships in Trees Since 1985”), phone ((215) 956-9990), email (office@willowtreeservice.com), website (willowtreeservice.com), address (411 South Warminster Rd, Hatboro PA 19040), service description (Beech Leaf Disease macro-injection on trees #1, 5, 6), treatment efficacy (90-95%), next-treatment window (2 years), invoice number, payment status

All of that became a full contractor record + 2027 calendar reminder + cross-reference to the garden-maintenance shortlist in ~3 minutes. None of it would have happened without the manual PDF read.

The fix (proposed shape)

Add a PDF-handling step BEFORE the vision call:

# pa_receipt_vision_pipeline.py — new branch
if receipt_path.suffix.lower() == ".pdf":
    # Rasterize each page to PNG
    pages = pdf2image.convert_from_path(receipt_path, dpi=200)
    page_results = []
    for i, page_image in enumerate(pages):
        # Save page as temp PNG
        tmp_png = f"/tmp/{receipt_path.stem}_page{i+1}.png"
        page_image.save(tmp_png, "PNG")
        # Send through the SAME Haiku vision pipeline that handles JPGs
        page_results.append(call_vision_haiku(tmp_png))
    # Merge per-page results into one sidecar (line items concatenated, vendor from first page)
    merged = merge_pdf_page_results(page_results)
    write_sidecar(receipt_path, merged)

Tooling options: - pdf2image (Python wrapper around pdftoppm) — cleanest API - pdftoppm (poppler CLI) — no Python dep, just shell out - pymupdf — text-extraction fallback for text-PDFs (faster, no vision needed when the PDF has selectable text)

Hybrid approach is probably the right shape: 1. First try pymupdf to extract text (free, fast) 2. If text extraction yields zero / mostly-empty (image-only PDF), fall back to rasterize + vision

Cost estimate

Path	Per receipt	100-PDF corpus	Notes
pymupdf text extraction	$0 (local CPU)	$0	Works for ~60% of receipts (text-PDFs from invoice generators)
Rasterize + Haiku vision (Batches)	~$0.001/page · most receipts 1-2 pages	~$0.10-0.30	Catches the image-only PDFs the text path misses
Combined	$0 to $0.003	$0.10-0.30	Backfill the whole Harvest PDF corpus for <$1

What it unlocks

Capability today (JPG/PNG only)	Capability after PDF extension
Jorge identity from 2024 JPG receipt → recovered	Willow Tree identity from 2025 PDF → also recovered
Hardware-store JPG receipts → line items extracted	Service-invoice PDFs (most contractors) → line items extracted
Maybe 40-50% of receipts indexed	~95%+ of receipts indexed
`grep -r "vendor" receipts/` returns partial	`grep -r "vendor" receipts/` returns complete

Files to touch

~/bin/pa_harvest_receipt_vision_batch.py (or wherever the pipeline lives) — add the PDF branch
Re-run against pa/properties/new-hope/expenses/receipts/**/*.pdf to backfill
Status JSON: extend pa/_status/receipt-ocr-coverage.status.json to count PDFs as a separate dimension (today they all count as _error)

Cross-references

feedback_filling_in_blanks_substrate_payoff.md — this is a foundational-substrate upgrade; once shipped, every future “find vendor name from receipt” becomes a one-grep operation
parked_sketch_receipt_vision_ocr_vehicle_history_2026-05-24.md — the original parked-sketch that established the vision-OCR pattern; this is the PDF extension to that
feedback_two_layer_decision_architecture.md — the hybrid “text-extract first, vision-OCR fallback” maps cleanly to the two-layer pattern (rules first, LLM for residue)
feedback_one_batch_llm_beats_hand_iteration.md — Haiku batch over the residue keeps cost in pennies

The aphorism

Every PDF receipt the pipeline can’t read is a vendor identity hiding in plain sight. Tonight proved one is worth a full 40-year contractor record. The whole corpus is worth <$1 to unlock.

Source: parked_sketch_pdf_ocr_pipeline_extension_2026-05-25.md · Rendered 2026-07-15 13:05 UTC

Built with — component scripts

seo_render_html.py — wraps the source .md in the dash.gf.cx design language (+ anchor_enricher.py for inline-link promotion & rollover thumbnails)
dare_dev_reports_publish.py — bundles the day’s reports into the catalog and ships to dash.gf.cx/reports