Vision-LLM OCR over Harvest receipt scans — unlock mileage-over-time + service substrate (parked 2026-05-24)
DARE.CO.UK · PARKED SKETCH · 2026-05-26
Mirrored from ~/.claude/.../memory/parked_sketch_receipt_vision_ocr_vehicle_history_2026-05-24.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
Harvest “notes” fields are summary-only (“LR4”, “Replacement Wing Mirror LR4 via eBay”); the actual maintenance data (odometer, line items, fluids, wear measurements) lives INSIDE the receipt scan images. A vision-LLM batch pass over the 5,607 Harvest receipts costs ~$4 and reconstructs the entire vehicle-service substrate as structured JSON. First consumer: real mileage-over-time charts per vehicle. Second: cost-per-mile by window, service-interval drift detection, parts catalog by model.
Dan 2026-05-24: “the real catch, is extracting from Harvest scans, data inside receipts, that holds the key the milage-over-time unlock”
The problem this solves
Asked LR4 for its mileage. Harvest has 145 LR4-tagged receipts. Grep returned 0 with mileage readings — because the notes field is a 2-word summary, not the receipt content. The odometer is inscribed on the service invoice image, never typed into Harvest’s note field.
Result: today we ask Dan (“125,000ish? Actually 129,050. Wait, that’s the F-250.”), the truth lives inside images, and we can’t run any time-series analysis (cost-per-mile, service-interval drift, mileage acceleration) until the substrate moves from images to JSON.
The mechanism
Vision-LLM pass over receipt JPG/PDF scans. Per receipt, extract structured JSON:
{
"vendor": "A&G Customs",
"service_date": "2023-02-14",
"odometer_mi": 128986,
"vehicle_hint": "F-250", // VIN snippet or plate or written model
"line_items": [
{"description": "Exhaust manifolds replacement", "amount": 1240.00},
{"description": "Tie rod replacement", "amount": 420.50},
{"description": "Brake fluid flush", "amount": 89.00},
{"description": "Front shocks (pair)", "amount": 285.00},
{"description": "2x Hankook Dynapro AT2 RF11 tires", "amount": 412.00},
{"description": "PA state inspection", "amount": 42.00}
],
"total": 3228.15,
"fluids_noted": ["brake fluid flush"],
"wear_measurements": {"front_brakes_mm": null, "tire_tread_mm": null},
"next_service_due": null,
"tech_notes": "spark plugs original — danger zone above 120k"
}
Cost (Haiku 4.5 Batches with vision)
| Scope | Receipts | Cost |
|---|---|---|
| LR4 only | 145 | ~$0.10 |
| All 3 vehicles | ~400 | ~$0.30 |
| Full Harvest receipt corpus | 5,607 | ~$4 |
Per receipt: ~500-1500 input tokens (image) + ~300 output tokens. At Haiku Batches rates (no caching applicable per feedback_micro_gestures_pattern.md cost analysis) ≈ $0.0007 per receipt.
Handwriting is included — same model
Dan 2026-05-24 follow-up: “Add to the list, hand-written invoices from Jorge - will need a hand-writing-OCR-capability to infer and capture”
No special handwriting-OCR tooling needed. Haiku 4.5 vision handles printed receipts and handwritten invoices through the same API call. Legibility is the only constraint — clearly-written contractor receipts (Jorge garden work, Jose electrics, A&G Customs hand-noted services) extract just as well as printed POS receipts. Where handwriting is borderline, the prompt can ask the model to flag low-confidence fields.
What the substrate unlocks (downstream)
- Mileage-over-time chart per vehicle — actual odometer history with service-date x-axis. Not Dan-estimates.
- Real miles/year by year — was the BMW driven 17k in 2024 vs 8k in 2025? Now answerable.
- Cost-per-mile by window — service-cost density when accumulating miles vs parked.
- Service-interval drift detection — “last oil change 8k mi ago vs 5k recommended” alerts.
- Parts catalog per vehicle — which model has eaten the most brake pads / spark plugs / tires.
- Replacement-anniversary tracking — “tires due Feb 2027 based on 2023 install + 50k mi life”.
- Pre-2017 mileage gaps — even one old service receipt with odometer pegs the timeline.
- Future insurance/sale ammo — full service history reproducible from substrate for sale listing or claim packet.
Why now isn’t urgent (but soon is right)
Wait condition: nothing depends on the vehicle history TODAY. But the moment any of these surfaces: - Selling a vehicle (need maintenance history packet) - Insurance review (need service substantiation) - Deciding LR4 keep-vs-sell (cost-per-mile vs replacement) - BMW high-mileage service planning (when’s the next big spend predicted?)
…this becomes the load-bearing data. Build before the question is asked, not after.
Build approach
| Step | Effort | Notes |
|---|---|---|
| Inventory receipt images in Harvest substrate | 15 min | Find paths, count, sample image quality |
Write pa_harvest_receipts_vision_extract.py |
1 hr | Anthropic Batches API + vision; ~150 lines |
| Run batch on 3-vehicle subset first | 20 min wall-time | ~$0.30, validate JSON quality |
Land per-receipt JSON sidecar at pa/properties/.../receipts/<id>.json |
30 min | Idempotent, re-runnable |
| Update vehicle pages to render mileage-over-time chart | 45 min | SVG line chart; reuse existing chart primitives |
| Full-corpus run + dashboards (savings, cost-per-mile, drift alerts) | 2 hrs | Additional surfaces |
| Subset for vehicles only | ~3 hrs | LR4 + BMW + F-250 mileage history charts live |
| Full vehicle service substrate + downstream surfaces | ~5-6 hrs | Everything above |
Build trigger
Build when ANY: - Vehicle sale/keep decision is on the table - Insurance review window - Dan asks “what did the LR4 cost per mile in 2024?” - After the unified-toolkit-scaffold build completes (sibling priority)
Cross-references
parked_sketch_amazon_5pct_cashback_calculator_2026-05-24.md— sibling substrate-payoff sketch (cash-back vs vehicle cost-per-mile)feedback_one_batch_llm_beats_hand_iteration.md— the LLM-batch-as-substrate-extractor patternfeedback_vendor_lookup_layer2_2026-05-23.md— vendor-name disambiguation via LLM; sibling layer-2 workfeedback_substrate_recategorization_assisted.md— same recipe (rule sweep + LLM residue + human review)pa/_data/vehicles.json— consumer of the mileage datapa/properties/new-hope/expenses/_data/expenses.jsonl— Harvest substrate (5,607 receipts)
Validation run — 2026-05-24 sample results
Ran pa_harvest_receipt_vision_batch.py on the 138-receipt service-shaped subset (vehicle services + handyman invoices, filtered via filename regex). Findings:
| Metric | Value |
|---|---|
| Receipts processed | 138 (120 success · 18 rate-limit errors) |
| Wall time | 229s (~4 min) at 3 workers + preprocessing |
| Cost | $0.38 total · $0.0032/receipt |
| Odometer hit rate | 15 of 120 successful = 12.5% |
| Per-image preprocessing benefit | sips resize+compress: ~50% file size, comparable token reduction |
Reconstructed mileage timeline from receipts alone:
| Vehicle | Span | Pace (receipt-derived) | Cross-check |
|---|---|---|---|
| LR4 | 23,770mi (Dec 2016) → 112,962mi (Jan 2025) | 11,069 mi/yr | ✓ ~125k Dan estimate now |
| BMW | 67,428mi (Apr 2024) → 102,624mi (Feb 2026) | 19,130 mi/yr | ✓ “driven hard” matches |
| F-250 | 128,373mi (Oct 2020) → 128,986mi (Feb 2023) | 267 mi/yr | ✓ effectively parked |
Data-quality issues to address before charting:
- ~25% of odometer captures pulled the wrong number (Costco gas transaction code, Amex statement reference, $0 line items where “mileage” was a quote not a reading). Filter rule: require co-occurring service-vendor name + non-zero total.
- 4 receipts had vehicle=”?” — filename lacks vehicle keyword. Re-attribute via Harvest expense ID join.
- Vendor names sometimes drift between OCR runs (A&C Customs vs A&G Customs). Could canonicalize via per-vendor fuzzy-match pass.
- Rate-limit errors recover cleanly on
--forcererun.
Two-layer architecture validated by Dan 2026-05-24
Dan: “OCR gets you 70% of the way, with JSON. Then level 2 clean-up gets you to 95%.”
The clean substrate that enables cross-vendor / cross-time product-level queries (“coffee in 2020 vs 2025 at Giant vs Trader Joe’s“) needs both layers:
| Layer | What it does | Cost (full Harvest corpus) |
|---|---|---|
| L1 — Vision OCR | Receipt image → {vendor, date, line_items[{description, amount}], total, ...}. Captures the raw printed text. |
~$14 batches / $28 standard |
| L2 — Product normalization | "MAXWELL HSE INST CFE 8OZ" → {category: "Coffee", brand: "Maxwell House", type: "Instant", size_oz: 8, unit_price_oz: 1.12}. Enables clean joins across receipts. |
~$10-20 batches |
| L3 — Query/dashboard | Aggregations, comparisons, charts | $0 |
| Total for clean product substrate | ~$25-35 |
Without L2, line-item search is grep-and-hope across receipt-printer abbreviations. With L2, it’s a real database.
Reusable pipeline-pattern emerging — vision-OCR ingest
This is the second “vision-LLM-as-substrate-extractor” pattern instance (Amazon emails were the first via Chrome screenshot → email-evidence). Generalizes to:
| Domain | Source images | What gets extracted |
|---|---|---|
| Vehicle service | Garage receipts, hand-written invoices | Odometer, line items, fluids, wear |
| Insurance claims | Inspector reports, contractor estimates | Items, replacement costs, condition |
| Property deeds/surveys | Scanned legal PDFs | Plat, dimensions, easements |
| Tax returns / W2s / 1099s | Scanned forms | Line-item amounts, withholdings |
| Watch / camera certificates | Appraisals, warranties | Serial numbers, valuations |
| Bank statements / canceled checks | PDF / image statements | Transaction lines |
| Land registry | Old paper records | Parcel IDs, owner history |
Worth a sibling memory: feedback_vision_ocr_ingest_pipeline.md capturing the generic pattern (filter by filename → preprocess via sips → vision-extract → JSON sidecar → optional L2 normalize → query substrate).
Primary target — audreyinc datasets (not pa vehicle receipts)
Dan 2026-05-24 follow-up: “Mostly, it’s for audreyinc - processing her datasets”
The vehicle-receipt sample run was the validation case. The real build is processing Audrey’s business data substrate. Vehicle reactivation only happens if a specific pa-side question lands; the bigger ROI is audreyinc-side.
Likely audreyinc image substrates worth processing (pending Dan/Audrey-confirmed inventory):
| Substrate | What L1 extracts | What L2 normalizes | Downstream queries |
|---|---|---|---|
| Vendor / supplier invoices | vendor, date, line items, totals | canonical supplier names, fabric/material/component categories, unit costs | “Cost per yard of cotton across mills · year over year” |
| Trade-show / wholesale order forms | customer, date, SKUs, quantities | canonical customer names + customer territory + product category | “Boutiques in NY ordering X over time” |
| Production / pattern notes (handwritten) | style ID, sizing, fit notes, sample iterations | canonical style code, season, collection | “All fit revisions for style #423” |
| Customer / press contact cards (trade shows) | name, brand, role, contact | merged-into-CRM canonical contacts | “Press contacts from Cologne 2024 we haven’t followed up with” |
| Wholesale / consignment statements | retailer, period, units sold, payout | retailer normalize + product joins | “Top retailers by units · sell-through rates” |
| Returns / RMA notes (handwritten) | order#, reason, condition | canonical reason categories | “Top return reasons by style · year” |
| Photoshoot logs | shoot date, photographer, location, garments | canonical garment SKUs | “Which styles have lookbook coverage” |
| Archived paper sales records pre-Shopify | every order line | normalized SKUs | “Pre-2015 catalog reconstruction for archive” |
This connects to existing audreyinc work in memory:
parked_sketch_shopify_full_lifecycle_2026-05-23.md— Shopify side of the audreyinc surface; vision-OCR feeds the pre/parallel-Shopify paper trail into the same substratefeedback_audrey_eras_few_clicks_deep.md— archive eras snapshot project; OCR of physical archive feeds the same era-by-era narrativefeedback_email_ingest_pattern_extends_to_ebay.md— sibling extraction pattern for digital substrates; vision-OCR is the analog-document counterpart
Vehicle receipts — validation scope only
The 138-receipt run above was the proof-of-mechanism. The actual vehicle-side work (cleanup, charting on pa.gf.cx/vehicles/) only proceeds if a specific pa question lands:
- Selling a vehicle (need maintenance history packet)
- Insurance review window
- “What did the LR4 cost per mile in 2024?” or sibling question
- Kitchen-reno / claim-packet build needs the receipt substrate
Resume / build trigger
Status: PARKED post-validation 2026-05-24. Sample run proved the mechanism works at acceptable quality and cost on vehicle receipts. Wait condition before unparking:
- audreyinc data substrates inventoried — Dan/Audrey identify which boxes of paper/scanned material are worth processing first
- A specific audreyinc question that needs the substrate (“trade-show contacts from 2024 we never followed up with”, “wholesale sell-through by retailer”, “vendor cost per fabric category”)
- pa-side trigger (insurance / sell / claim) as above
When unparked for audreyinc: scope and cost depend on inventory size + which substrates (vendor invoices alone is a different size than full archive). The pipeline pattern (filter → preprocess → vision L1 → optional L2 → query) is the same; the per-substrate cost scales with image count.
The aphorism
The data isn’t missing. It’s just inside the image. Vision-LLM is the can-opener. The second pass is the categorizer that turns the can’s contents into ingredients.