Extend vision-OCR pipeline to PDF receipts — Dan-flagged high priority (parked 2026-05-25)
DARE.CO.UK · PARKED SKETCH · 2026-05-31
Mirrored from ~/.claude/.../memory/parked_sketch_pdf_ocr_pipeline_extension_2026-05-25.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
The vision-OCR receipt pipeline (Haiku-on-images) currently emits
{"_error": "pdf_not_supported_by_vision_api"}for any PDF receipt. Tonight’s Willow Tree discovery proved the cost — a $1,500 tree-care invoice that fully captured the vendor identity (Willow Tree and Landscape Services, LLC · 40 years tenure · Hatboro PA · phone · email · domain · service line items) was invisible to the substrate until Claude read the PDF manually. Dan: “Worth extending the OCR pipeline to PDFs we need this!” Fix shape: rasterize each PDF page viapdftoppm/pdf2image→ run each page through the same Haiku vision endpoint the JPG/PNG pipeline uses → merge per-page JSON sidecars. Cost ~$0.001 per page via Haiku Batches. ~$1-3 to backfill the full Harvest PDF-receipt corpus. Multiple instances already proven valuable (Jorge identity via JPG-OCR; Willow Tree via manual PDF-read; ChipDrop receipt via Dan’s emailed-receipt screenshot).
Dan 2026-05-25: “100% — Worth extending the OCR pipeline to PDFs we need this!” — after seeing how a single PDF read revealed Willow Tree’s full identity (40 years, phone, website, email, service details) from a receipt the existing pipeline marked as pdf_not_supported_by_vision_api.
The current gap (concrete)
The vision-OCR receipt pipeline runs against JPG/PNG receipts and emits structured sidecar JSON (vendor, line items, totals, dates). For PDFs, the sidecar is just: