Extend vision-OCR pipeline to PDF receipts — Dan-flagged high priority (parked 2026-05-25)
DARE.CO.UK · PARKED SKETCH · 2026-05-26
Mirrored from ~/.claude/.../memory/parked_sketch_pdf_ocr_pipeline_extension_2026-05-25.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
The vision-OCR receipt pipeline (Haiku-on-images) currently emits
{"_error": "pdf_not_supported_by_vision_api"}for any PDF receipt. Tonight’s Willow Tree discovery proved the cost — a $1,500 tree-care invoice that fully captured the vendor identity (Willow Tree and Landscape Services, LLC · 40 years tenure · Hatboro PA · phone · email · domain · service line items) was invisible to the substrate until Claude read the PDF manually. Dan: “Worth extending the OCR pipeline to PDFs we need this!” Fix shape: rasterize each PDF page viapdftoppm/pdf2image→ run each page through the same Haiku vision endpoint the JPG/PNG pipeline uses → merge per-page JSON sidecars. Cost ~$0.001 per page via Haiku Batches. ~$1-3 to backfill the full Harvest PDF-receipt corpus. Multiple instances already proven valuable (Jorge identity via JPG-OCR; Willow Tree via manual PDF-read; ChipDrop receipt via Dan’s emailed-receipt screenshot).
Dan 2026-05-25: “100% — Worth extending the OCR pipeline to PDFs we need this!” — after seeing how a single PDF read revealed Willow Tree’s full identity (40 years, phone, website, email, service details) from a receipt the existing pipeline marked as pdf_not_supported_by_vision_api.
The current gap (concrete)
The vision-OCR receipt pipeline runs against JPG/PNG receipts and emits structured sidecar JSON (vendor, line items, totals, dates). For PDFs, the sidecar is just:
�STASH5�
So any receipt the property owner uploaded as PDF is invisible to:
- Vendor identification (the contractor record-builder)
- Line-item extraction (per-job spend granularity)
- Cross-receipt grep (grep -r "Willow Tree" receipts/ returns 0 hits)
- Future analytics (per-vendor totals, service-frequency)
Tonight’s proof point
The Willow Tree and Landscape Services discovery:
- Harvest entry said only "Tree Surgery" as vendor
- Receipt was a PDF: Invoice_42138_6279_Greenhill_Road.pdf
- Pipeline sidecar: {"_error": "pdf_not_supported_by_vision_api"}
- Manual Read of the PDF surfaced: company name (Willow Tree and Landscape Services, LLC), 40-year tenure tagline (“Rooting Relationships in Trees Since 1985”), phone ((215) 956-9990), email (office@willowtreeservice.com), website (willowtreeservice.com), address (411 South Warminster Rd, Hatboro PA 19040), service description (Beech Leaf Disease macro-injection on trees #1, 5, 6), treatment efficacy (90-95%), next-treatment window (2 years), invoice number, payment status
All of that became a full contractor record + 2027 calendar reminder + cross-reference to the garden-maintenance shortlist in ~3 minutes. None of it would have happened without the manual PDF read.
The fix (proposed shape)
Add a PDF-handling step BEFORE the vision call:
# pa_receipt_vision_pipeline.py — new branch
if receipt_path.suffix.lower() == ".pdf":
# Rasterize each page to PNG
pages = pdf2image.convert_from_path(receipt_path, dpi=200)
page_results = []
for i, page_image in enumerate(pages):
# Save page as temp PNG
tmp_png = f"/tmp/{receipt_path.stem}_page{i+1}.png"
page_image.save(tmp_png, "PNG")
# Send through the SAME Haiku vision pipeline that handles JPGs
page_results.append(call_vision_haiku(tmp_png))
# Merge per-page results into one sidecar (line items concatenated, vendor from first page)
merged = merge_pdf_page_results(page_results)
write_sidecar(receipt_path, merged)
Tooling options:
- pdf2image (Python wrapper around pdftoppm) — cleanest API
- pdftoppm (poppler CLI) — no Python dep, just shell out
- pymupdf — text-extraction fallback for text-PDFs (faster, no vision needed when the PDF has selectable text)
Hybrid approach is probably the right shape:
1. First try pymupdf to extract text (free, fast)
2. If text extraction yields zero / mostly-empty (image-only PDF), fall back to rasterize + vision
Cost estimate
| Path | Per receipt | 100-PDF corpus | Notes |
|---|---|---|---|
| pymupdf text extraction | $0 (local CPU) | $0 | Works for ~60% of receipts (text-PDFs from invoice generators) |
| Rasterize + Haiku vision (Batches) | ~$0.001/page · most receipts 1-2 pages | ~$0.10-0.30 | Catches the image-only PDFs the text path misses |
| Combined | $0 to $0.003 | $0.10-0.30 | Backfill the whole Harvest PDF corpus for <$1 |
What it unlocks
| Capability today (JPG/PNG only) | Capability after PDF extension |
|---|---|
| Jorge identity from 2024 JPG receipt → recovered | Willow Tree identity from 2025 PDF → also recovered |
| Hardware-store JPG receipts → line items extracted | Service-invoice PDFs (most contractors) → line items extracted |
| Maybe 40-50% of receipts indexed | ~95%+ of receipts indexed |
grep -r "vendor" receipts/ returns partial |
grep -r "vendor" receipts/ returns complete |
Files to touch
~/bin/pa_harvest_receipt_vision_batch.py(or wherever the pipeline lives) — add the PDF branch- Re-run against
pa/properties/new-hope/expenses/receipts/**/*.pdfto backfill - Status JSON: extend
pa/_status/receipt-ocr-coverage.status.jsonto count PDFs as a separate dimension (today they all count as_error)
Cross-references
feedback_filling_in_blanks_substrate_payoff.md— this is a foundational-substrate upgrade; once shipped, every future “find vendor name from receipt” becomes a one-grep operationparked_sketch_receipt_vision_ocr_vehicle_history_2026-05-24.md— the original parked-sketch that established the vision-OCR pattern; this is the PDF extension to thatfeedback_two_layer_decision_architecture.md— the hybrid “text-extract first, vision-OCR fallback” maps cleanly to the two-layer pattern (rules first, LLM for residue)feedback_one_batch_llm_beats_hand_iteration.md— Haiku batch over the residue keeps cost in pennies
The aphorism
Every PDF receipt the pipeline can’t read is a vendor identity hiding in plain sight. Tonight proved one is worth a full 40-year contractor record. The whole corpus is worth <$1 to unlock.