Vision-LLM OCR over Harvest receipt scans — unlock mileage-over-time + service substrate (parked 2026-05-24)

DARE.CO.UK · PARKED SKETCH · 2026-05-26

Mirrored from ~/.claude/.../memory/parked_sketch_receipt_vision_ocr_vehicle_history_2026-05-24.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

Harvest “notes” fields are summary-only (“LR4”, “Replacement Wing Mirror LR4 via eBay”); the actual maintenance data (odometer, line items, fluids, wear measurements) lives INSIDE the receipt scan images. A vision-LLM batch pass over the 5,607 Harvest receipts costs ~$4 and reconstructs the entire vehicle-service substrate as structured JSON. First consumer: real mileage-over-time charts per vehicle. Second: cost-per-mile by window, service-interval drift detection, parts catalog by model.


Dan 2026-05-24: “the real catch, is extracting from Harvest scans, data inside receipts, that holds the key the milage-over-time unlock”

The problem this solves

Asked LR4 for its mileage. Harvest has 145 LR4-tagged receipts. Grep returned 0 with mileage readings — because the notes field is a 2-word summary, not the receipt content. The odometer is inscribed on the service invoice image, never typed into Harvest’s note field.

Result: today we ask Dan (“125,000ish? Actually 129,050. Wait, that’s the F-250.”), the truth lives inside images, and we can’t run any time-series analysis (cost-per-mile, service-interval drift, mileage acceleration) until the substrate moves from images to JSON.

The mechanism

Vision-LLM pass over receipt JPG/PDF scans. Per receipt, extract structured JSON:

{
  "vendor": "A&G Customs",
  "service_date": "2023-02-14",
  "odometer_mi": 128986,
  "vehicle_hint": "F-250",          // VIN snippet or plate or written model
  "line_items": [
    {"description": "Exhaust manifolds replacement", "amount": 1240.00},
    {"description": "Tie rod replacement", "amount": 420.50},
    {"description": "Brake fluid flush", "amount": 89.00},
    {"description": "Front shocks (pair)", "amount": 285.00},
    {"description": "2x Hankook Dynapro AT2 RF11 tires", "amount": 412.00},
    {"description": "PA state inspection", "amount": 42.00}
  ],
  "total": 3228.15,
  "fluids_noted": ["brake fluid flush"],
  "wear_measurements": {"front_brakes_mm": null, "tire_tread_mm": null},
  "next_service_due": null,
  "tech_notes": "spark plugs original — danger zone above 120k"
}

Cost (Haiku 4.5 Batches with vision)

Scope Receipts Cost
LR4 only 145 ~$0.10
All 3 vehicles ~400 ~$0.30
Full Harvest receipt corpus 5,607 ~$4

Per receipt: ~500-1500 input tokens (image) + ~300 output tokens. At Haiku Batches rates (no caching applicable per feedback_micro_gestures_pattern.md cost analysis) ≈ $0.0007 per receipt.

Handwriting is included — same model

Dan 2026-05-24 follow-up: “Add to the list, hand-written invoices from Jorge - will need a hand-writing-OCR-capability to infer and capture”

No special handwriting-OCR tooling needed. Haiku 4.5 vision handles printed receipts and handwritten invoices through the same API call. Legibility is the only constraint — clearly-written contractor receipts (Jorge garden work, Jose electrics, A&G Customs hand-noted services) extract just as well as printed POS receipts. Where handwriting is borderline, the prompt can ask the model to flag low-confidence fields.

What the substrate unlocks (downstream)

  1. Mileage-over-time chart per vehicle — actual odometer history with service-date x-axis. Not Dan-estimates.
  2. Real miles/year by year — was the BMW driven 17k in 2024 vs 8k in 2025? Now answerable.
  3. Cost-per-mile by window — service-cost density when accumulating miles vs parked.
  4. Service-interval drift detection — “last oil change 8k mi ago vs 5k recommended” alerts.
  5. Parts catalog per vehicle — which model has eaten the most brake pads / spark plugs / tires.
  6. Replacement-anniversary tracking — “tires due Feb 2027 based on 2023 install + 50k mi life”.
  7. Pre-2017 mileage gaps — even one old service receipt with odometer pegs the timeline.
  8. Future insurance/sale ammo — full service history reproducible from substrate for sale listing or claim packet.

Why now isn’t urgent (but soon is right)

Wait condition: nothing depends on the vehicle history TODAY. But the moment any of these surfaces: - Selling a vehicle (need maintenance history packet) - Insurance review (need service substantiation) - Deciding LR4 keep-vs-sell (cost-per-mile vs replacement) - BMW high-mileage service planning (when’s the next big spend predicted?)

…this becomes the load-bearing data. Build before the question is asked, not after.

Build approach

Step Effort Notes
Inventory receipt images in Harvest substrate 15 min Find paths, count, sample image quality
Write pa_harvest_receipts_vision_extract.py 1 hr Anthropic Batches API + vision; ~150 lines
Run batch on 3-vehicle subset first 20 min wall-time ~$0.30, validate JSON quality
Land per-receipt JSON sidecar at pa/properties/.../receipts/<id>.json 30 min Idempotent, re-runnable
Update vehicle pages to render mileage-over-time chart 45 min SVG line chart; reuse existing chart primitives
Full-corpus run + dashboards (savings, cost-per-mile, drift alerts) 2 hrs Additional surfaces
Subset for vehicles only ~3 hrs LR4 + BMW + F-250 mileage history charts live
Full vehicle service substrate + downstream surfaces ~5-6 hrs Everything above

Build trigger

Build when ANY: - Vehicle sale/keep decision is on the table - Insurance review window - Dan asks “what did the LR4 cost per mile in 2024?” - After the unified-toolkit-scaffold build completes (sibling priority)

Cross-references

Validation run — 2026-05-24 sample results

Ran pa_harvest_receipt_vision_batch.py on the 138-receipt service-shaped subset (vehicle services + handyman invoices, filtered via filename regex). Findings:

Metric Value
Receipts processed 138 (120 success · 18 rate-limit errors)
Wall time 229s (~4 min) at 3 workers + preprocessing
Cost $0.38 total · $0.0032/receipt
Odometer hit rate 15 of 120 successful = 12.5%
Per-image preprocessing benefit sips resize+compress: ~50% file size, comparable token reduction

Reconstructed mileage timeline from receipts alone:

Vehicle Span Pace (receipt-derived) Cross-check
LR4 23,770mi (Dec 2016) → 112,962mi (Jan 2025) 11,069 mi/yr ✓ ~125k Dan estimate now
BMW 67,428mi (Apr 2024) → 102,624mi (Feb 2026) 19,130 mi/yr ✓ “driven hard” matches
F-250 128,373mi (Oct 2020) → 128,986mi (Feb 2023) 267 mi/yr ✓ effectively parked

Data-quality issues to address before charting:

Two-layer architecture validated by Dan 2026-05-24

Dan: “OCR gets you 70% of the way, with JSON. Then level 2 clean-up gets you to 95%.”

The clean substrate that enables cross-vendor / cross-time product-level queries (“coffee in 2020 vs 2025 at Giant vs Trader Joe’s“) needs both layers:

Layer What it does Cost (full Harvest corpus)
L1 — Vision OCR Receipt image → {vendor, date, line_items[{description, amount}], total, ...}. Captures the raw printed text. ~$14 batches / $28 standard
L2 — Product normalization "MAXWELL HSE INST CFE 8OZ"{category: "Coffee", brand: "Maxwell House", type: "Instant", size_oz: 8, unit_price_oz: 1.12}. Enables clean joins across receipts. ~$10-20 batches
L3 — Query/dashboard Aggregations, comparisons, charts $0
Total for clean product substrate ~$25-35

Without L2, line-item search is grep-and-hope across receipt-printer abbreviations. With L2, it’s a real database.

Reusable pipeline-pattern emerging — vision-OCR ingest

This is the second “vision-LLM-as-substrate-extractor” pattern instance (Amazon emails were the first via Chrome screenshot → email-evidence). Generalizes to:

Domain Source images What gets extracted
Vehicle service Garage receipts, hand-written invoices Odometer, line items, fluids, wear
Insurance claims Inspector reports, contractor estimates Items, replacement costs, condition
Property deeds/surveys Scanned legal PDFs Plat, dimensions, easements
Tax returns / W2s / 1099s Scanned forms Line-item amounts, withholdings
Watch / camera certificates Appraisals, warranties Serial numbers, valuations
Bank statements / canceled checks PDF / image statements Transaction lines
Land registry Old paper records Parcel IDs, owner history

Worth a sibling memory: feedback_vision_ocr_ingest_pipeline.md capturing the generic pattern (filter by filename → preprocess via sips → vision-extract → JSON sidecar → optional L2 normalize → query substrate).

Primary target — audreyinc datasets (not pa vehicle receipts)

Dan 2026-05-24 follow-up: “Mostly, it’s for audreyinc - processing her datasets”

The vehicle-receipt sample run was the validation case. The real build is processing Audrey’s business data substrate. Vehicle reactivation only happens if a specific pa-side question lands; the bigger ROI is audreyinc-side.

Likely audreyinc image substrates worth processing (pending Dan/Audrey-confirmed inventory):

Substrate What L1 extracts What L2 normalizes Downstream queries
Vendor / supplier invoices vendor, date, line items, totals canonical supplier names, fabric/material/component categories, unit costs “Cost per yard of cotton across mills · year over year”
Trade-show / wholesale order forms customer, date, SKUs, quantities canonical customer names + customer territory + product category “Boutiques in NY ordering X over time”
Production / pattern notes (handwritten) style ID, sizing, fit notes, sample iterations canonical style code, season, collection “All fit revisions for style #423”
Customer / press contact cards (trade shows) name, brand, role, contact merged-into-CRM canonical contacts “Press contacts from Cologne 2024 we haven’t followed up with”
Wholesale / consignment statements retailer, period, units sold, payout retailer normalize + product joins “Top retailers by units · sell-through rates”
Returns / RMA notes (handwritten) order#, reason, condition canonical reason categories “Top return reasons by style · year”
Photoshoot logs shoot date, photographer, location, garments canonical garment SKUs “Which styles have lookbook coverage”
Archived paper sales records pre-Shopify every order line normalized SKUs “Pre-2015 catalog reconstruction for archive”

This connects to existing audreyinc work in memory:

Vehicle receipts — validation scope only

The 138-receipt run above was the proof-of-mechanism. The actual vehicle-side work (cleanup, charting on pa.gf.cx/vehicles/) only proceeds if a specific pa question lands:

  1. Selling a vehicle (need maintenance history packet)
  2. Insurance review window
  3. “What did the LR4 cost per mile in 2024?” or sibling question
  4. Kitchen-reno / claim-packet build needs the receipt substrate

Resume / build trigger

Status: PARKED post-validation 2026-05-24. Sample run proved the mechanism works at acceptable quality and cost on vehicle receipts. Wait condition before unparking:

  1. audreyinc data substrates inventoried — Dan/Audrey identify which boxes of paper/scanned material are worth processing first
  2. A specific audreyinc question that needs the substrate (“trade-show contacts from 2024 we never followed up with”, “wholesale sell-through by retailer”, “vendor cost per fabric category”)
  3. pa-side trigger (insurance / sell / claim) as above

When unparked for audreyinc: scope and cost depend on inventory size + which substrates (vendor invoices alone is a different size than full archive). The pipeline pattern (filter → preprocess → vision L1 → optional L2 → query) is the same; the per-substrate cost scales with image count.

The aphorism

The data isn’t missing. It’s just inside the image. Vision-LLM is the can-opener. The second pass is the categorizer that turns the can’s contents into ingredients.

Source: parked_sketch_receipt_vision_ocr_vehicle_history_2026-05-24.md · Rendered 2026-05-26 17:10