Extend vision-OCR pipeline to PDF receipts — Dan-flagged high priority (parked 2026-05-25)

DARE.CO.UK · PARKED SKETCH · 2026-05-26

Mirrored from ~/.claude/.../memory/parked_sketch_pdf_ocr_pipeline_extension_2026-05-25.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

The vision-OCR receipt pipeline (Haiku-on-images) currently emits {"_error": "pdf_not_supported_by_vision_api"} for any PDF receipt. Tonight’s Willow Tree discovery proved the cost — a $1,500 tree-care invoice that fully captured the vendor identity (Willow Tree and Landscape Services, LLC · 40 years tenure · Hatboro PA · phone · email · domain · service line items) was invisible to the substrate until Claude read the PDF manually. Dan: “Worth extending the OCR pipeline to PDFs we need this!” Fix shape: rasterize each PDF page via pdftoppm / pdf2image → run each page through the same Haiku vision endpoint the JPG/PNG pipeline uses → merge per-page JSON sidecars. Cost ~$0.001 per page via Haiku Batches. ~$1-3 to backfill the full Harvest PDF-receipt corpus. Multiple instances already proven valuable (Jorge identity via JPG-OCR; Willow Tree via manual PDF-read; ChipDrop receipt via Dan’s emailed-receipt screenshot).


Dan 2026-05-25: “100% — Worth extending the OCR pipeline to PDFs we need this!” — after seeing how a single PDF read revealed Willow Tree’s full identity (40 years, phone, website, email, service details) from a receipt the existing pipeline marked as pdf_not_supported_by_vision_api.

The current gap (concrete)

The vision-OCR receipt pipeline runs against JPG/PNG receipts and emits structured sidecar JSON (vendor, line items, totals, dates). For PDFs, the sidecar is just:

�STASH5�

So any receipt the property owner uploaded as PDF is invisible to: - Vendor identification (the contractor record-builder) - Line-item extraction (per-job spend granularity) - Cross-receipt grep (grep -r "Willow Tree" receipts/ returns 0 hits) - Future analytics (per-vendor totals, service-frequency)

Tonight’s proof point

The Willow Tree and Landscape Services discovery: - Harvest entry said only "Tree Surgery" as vendor - Receipt was a PDF: Invoice_42138_6279_Greenhill_Road.pdf - Pipeline sidecar: {"_error": "pdf_not_supported_by_vision_api"} - Manual Read of the PDF surfaced: company name (Willow Tree and Landscape Services, LLC), 40-year tenure tagline (“Rooting Relationships in Trees Since 1985”), phone ((215) 956-9990), email (office@willowtreeservice.com), website (willowtreeservice.com), address (411 South Warminster Rd, Hatboro PA 19040), service description (Beech Leaf Disease macro-injection on trees #1, 5, 6), treatment efficacy (90-95%), next-treatment window (2 years), invoice number, payment status

All of that became a full contractor record + 2027 calendar reminder + cross-reference to the garden-maintenance shortlist in ~3 minutes. None of it would have happened without the manual PDF read.

The fix (proposed shape)

Add a PDF-handling step BEFORE the vision call:

# pa_receipt_vision_pipeline.py — new branch
if receipt_path.suffix.lower() == ".pdf":
    # Rasterize each page to PNG
    pages = pdf2image.convert_from_path(receipt_path, dpi=200)
    page_results = []
    for i, page_image in enumerate(pages):
        # Save page as temp PNG
        tmp_png = f"/tmp/{receipt_path.stem}_page{i+1}.png"
        page_image.save(tmp_png, "PNG")
        # Send through the SAME Haiku vision pipeline that handles JPGs
        page_results.append(call_vision_haiku(tmp_png))
    # Merge per-page results into one sidecar (line items concatenated, vendor from first page)
    merged = merge_pdf_page_results(page_results)
    write_sidecar(receipt_path, merged)

Tooling options: - pdf2image (Python wrapper around pdftoppm) — cleanest API - pdftoppm (poppler CLI) — no Python dep, just shell out - pymupdf — text-extraction fallback for text-PDFs (faster, no vision needed when the PDF has selectable text)

Hybrid approach is probably the right shape: 1. First try pymupdf to extract text (free, fast) 2. If text extraction yields zero / mostly-empty (image-only PDF), fall back to rasterize + vision

Cost estimate

Path Per receipt 100-PDF corpus Notes
pymupdf text extraction $0 (local CPU) $0 Works for ~60% of receipts (text-PDFs from invoice generators)
Rasterize + Haiku vision (Batches) ~$0.001/page · most receipts 1-2 pages ~$0.10-0.30 Catches the image-only PDFs the text path misses
Combined $0 to $0.003 $0.10-0.30 Backfill the whole Harvest PDF corpus for <$1

What it unlocks

Capability today (JPG/PNG only) Capability after PDF extension
Jorge identity from 2024 JPG receipt → recovered Willow Tree identity from 2025 PDF → also recovered
Hardware-store JPG receipts → line items extracted Service-invoice PDFs (most contractors) → line items extracted
Maybe 40-50% of receipts indexed ~95%+ of receipts indexed
grep -r "vendor" receipts/ returns partial grep -r "vendor" receipts/ returns complete

Files to touch

Cross-references

The aphorism

Every PDF receipt the pipeline can’t read is a vendor identity hiding in plain sight. Tonight proved one is worth a full 40-year contractor record. The whole corpus is worth <$1 to unlock.

Source: parked_sketch_pdf_ocr_pipeline_extension_2026-05-25.md · Rendered 2026-05-26 17:10