Vision-LLM OCR over Harvest receipt scans — unlock mileage-over-time + service substrate (parked 2026-05-24)

DARE.CO.UK · PARKED SKETCH · 2026-05-26

Mirrored from ~/.claude/.../memory/parked_sketch_receipt_vision_ocr_vehicle_history_2026-05-24.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

Harvest “notes” fields are summary-only (“LR4”, “Replacement Wing Mirror LR4 via eBay”); the actual maintenance data (odometer, line items, fluids, wear measurements) lives INSIDE the receipt scan images. A vision-LLM batch pass over the 5,607 Harvest receipts costs ~$4 and reconstructs the entire vehicle-service substrate as structured JSON. First consumer: real mileage-over-time charts per vehicle. Second: cost-per-mile by window, service-interval drift detection, parts catalog by model.

Dan 2026-05-24: “the real catch, is extracting from Harvest scans, data inside receipts, that holds the key the milage-over-time unlock”

The problem this solves

Asked LR4 for its mileage. Harvest has 145 LR4-tagged receipts. Grep returned 0 with mileage readings — because the notes field is a 2-word summary, not the receipt content. The odometer is inscribed on the service invoice image, never typed into Harvest’s note field.

Result: today we ask Dan (“125,000ish? Actually 129,050. Wait, that’s the F-250.”), the truth lives inside images, and we can’t run any time-series analysis (cost-per-mile, service-interval drift, mileage acceleration) until the substrate moves from images to JSON.

The mechanism

Vision-LLM pass over receipt JPG/PDF scans. Per receipt, extract structured JSON:

{
  "vendor": "A&G Customs",
  "service_date": "2023-02-14",
  "odometer_mi": 128986,
  "vehicle_hint": "F-250",          // VIN snippet or plate or written model
  "line_items": [
    {"description": "Exhaust manifolds replacement", "amount": 1240.00},
    {"description": "Tie rod replacement", "amount": 420.50},
    {"description": "Brake fluid flush", "amount": 89.00},
    {"description": "Front shocks (pair)", "amount": 285.00},
    {"description": "2x Hankook Dynapro AT2 RF11 tires", "amount": 412.00},
    {"description": "PA state inspection", "amount": 42.00}
  ],
  "total": 3228.15,
  "fluids_noted": ["brake fluid flush"],
  "wear_measurements": {"front_brakes_mm": null, "tire_tread_mm": null},
  "next_service_due": null,
  "tech_notes": "spark plugs original — danger zone above 120k"
}

Cost (Haiku 4.5 Batches with vision)

Scope	Receipts	Cost
LR4 only	145	~$0.10
All 3 vehicles	~400	~$0.30
Full Harvest receipt corpus	5,607	~$4

Per receipt: ~500-1500 input tokens (image) + ~300 output tokens. At Haiku Batches rates (no caching applicable per feedback_micro_gestures_pattern.md cost analysis) ≈ $0.0007 per receipt.

Handwriting is included — same model

Dan 2026-05-24 follow-up: “Add to the list, hand-written invoices from Jorge - will need a hand-writing-OCR-capability to infer and capture”

No special handwriting-OCR tooling needed. Haiku 4.5 vision handles printed receipts and handwritten invoices through the same API call. Legibility is the only constraint — clearly-written contractor receipts (Jorge garden work, Jose electrics, A&G Customs hand-noted services) extract just as well as printed POS receipts. Where handwriting is borderline, the prompt can ask the model to flag low-confidence fields.

What the substrate unlocks (downstream)

Mileage-over-time chart per vehicle — actual odometer history with service-date x-axis. Not Dan-estimates.
Real miles/year by year — was the BMW driven 17k in 2024 vs 8k in 2025? Now answerable.
Cost-per-mile by window — service-cost density when accumulating miles vs parked.
Service-interval drift detection — “last oil change 8k mi ago vs 5k recommended” alerts.
Parts catalog per vehicle — which model has eaten the most brake pads / spark plugs / tires.
Replacement-anniversary tracking — “tires due Feb 2027 based on 2023 install + 50k mi life”.
Pre-2017 mileage gaps — even one old service receipt with odometer pegs the timeline.
Future insurance/sale ammo — full service history reproducible from substrate for sale listing or claim packet.

Why now isn’t urgent (but soon is right)

Wait condition: nothing depends on the vehicle history TODAY. But the moment any of these surfaces: - Selling a vehicle (need maintenance history packet) - Insurance review (need service substantiation) - Deciding LR4 keep-vs-sell (cost-per-mile vs replacement) - BMW high-mileage service planning (when’s the next big spend predicted?)

…this becomes the load-bearing data. Build before the question is asked, not after.

Build approach

Step	Effort	Notes
Inventory receipt images in Harvest substrate	15 min	Find paths, count, sample image quality
Write `pa_harvest_receipts_vision_extract.py`	1 hr	Anthropic Batches API + vision; ~150 lines
Run batch on 3-vehicle subset first	20 min wall-time	~$0.30, validate JSON quality
Land per-receipt JSON sidecar at `pa/properties/.../receipts/<id>.json`	30 min	Idempotent, re-runnable
Update vehicle pages to render mileage-over-time chart	45 min	SVG line chart; reuse existing chart primitives
Full-corpus run + dashboards (savings, cost-per-mile, drift alerts)	2 hrs	Additional surfaces
Subset for vehicles only	~3 hrs	LR4 + BMW + F-250 mileage history charts live
Full vehicle service substrate + downstream surfaces	~5-6 hrs	Everything above

Build trigger

Build when ANY: - Vehicle sale/keep decision is on the table - Insurance review window - Dan asks “what did the LR4 cost per mile in 2024?” - After the unified-toolkit-scaffold build completes (sibling priority)

Cross-references

parked_sketch_amazon_5pct_cashback_calculator_2026-05-24.md — sibling substrate-payoff sketch (cash-back vs vehicle cost-per-mile)
feedback_one_batch_llm_beats_hand_iteration.md — the LLM-batch-as-substrate-extractor pattern
feedback_vendor_lookup_layer2_2026-05-23.md — vendor-name disambiguation via LLM; sibling layer-2 work
feedback_substrate_recategorization_assisted.md — same recipe (rule sweep + LLM residue + human review)
pa/_data/vehicles.json — consumer of the mileage data
pa/properties/new-hope/expenses/_data/expenses.jsonl — Harvest substrate (5,607 receipts)

Validation run — 2026-05-24 sample results

Ran pa_harvest_receipt_vision_batch.py on the 138-receipt service-shaped subset (vehicle services + handyman invoices, filtered via filename regex). Findings:

Metric	Value
Receipts processed	138 (120 success · 18 rate-limit errors)
Wall time	229s (~4 min) at 3 workers + preprocessing
Cost	$0.38 total · $0.0032/receipt
Odometer hit rate	15 of 120 successful = 12.5%
Per-image preprocessing benefit	sips resize+compress: ~50% file size, comparable token reduction

Reconstructed mileage timeline from receipts alone:

Vehicle	Span	Pace (receipt-derived)	Cross-check
LR4	23,770mi (Dec 2016) → 112,962mi (Jan 2025)	11,069 mi/yr	✓ ~125k Dan estimate now
BMW	67,428mi (Apr 2024) → 102,624mi (Feb 2026)	19,130 mi/yr	✓ “driven hard” matches
F-250	128,373mi (Oct 2020) → 128,986mi (Feb 2023)	267 mi/yr	✓ effectively parked

Data-quality issues to address before charting:

~25% of odometer captures pulled the wrong number (Costco gas transaction code, Amex statement reference, $0 line items where “mileage” was a quote not a reading). Filter rule: require co-occurring service-vendor name + non-zero total.
4 receipts had vehicle=”?” — filename lacks vehicle keyword. Re-attribute via Harvest expense ID join.
Vendor names sometimes drift between OCR runs (A&C Customs vs A&G Customs). Could canonicalize via per-vendor fuzzy-match pass.
Rate-limit errors recover cleanly on --force rerun.

Two-layer architecture validated by Dan 2026-05-24

Dan: “OCR gets you 70% of the way, with JSON. Then level 2 clean-up gets you to 95%.”

The clean substrate that enables cross-vendor / cross-time product-level queries (“coffee in 2020 vs 2025 at Giant vs Trader Joe’s“) needs both layers:

Layer	What it does	Cost (full Harvest corpus)
L1 — Vision OCR	Receipt image → `{vendor, date, line_items[{description, amount}], total, ...}`. Captures the raw printed text.	~$14 batches / $28 standard
L2 — Product normalization	`"MAXWELL HSE INST CFE 8OZ"` → `{category: "Coffee", brand: "Maxwell House", type: "Instant", size_oz: 8, unit_price_oz: 1.12}`. Enables clean joins across receipts.	~$10-20 batches
L3 — Query/dashboard	Aggregations, comparisons, charts	$0
Total for clean product substrate		~$25-35

Without L2, line-item search is grep-and-hope across receipt-printer abbreviations. With L2, it’s a real database.

Reusable pipeline-pattern emerging — vision-OCR ingest

This is the second “vision-LLM-as-substrate-extractor” pattern instance (Amazon emails were the first via Chrome screenshot → email-evidence). Generalizes to:

Domain	Source images	What gets extracted
Vehicle service	Garage receipts, hand-written invoices	Odometer, line items, fluids, wear
Insurance claims	Inspector reports, contractor estimates	Items, replacement costs, condition
Property deeds/surveys	Scanned legal PDFs	Plat, dimensions, easements
Tax returns / W2s / 1099s	Scanned forms	Line-item amounts, withholdings
Watch / camera certificates	Appraisals, warranties	Serial numbers, valuations
Bank statements / canceled checks	PDF / image statements	Transaction lines
Land registry	Old paper records	Parcel IDs, owner history

Worth a sibling memory: feedback_vision_ocr_ingest_pipeline.md capturing the generic pattern (filter by filename → preprocess via sips → vision-extract → JSON sidecar → optional L2 normalize → query substrate).

Primary target — audreyinc datasets (not pa vehicle receipts)

Dan 2026-05-24 follow-up: “Mostly, it’s for audreyinc - processing her datasets”

The vehicle-receipt sample run was the validation case. The real build is processing Audrey’s business data substrate. Vehicle reactivation only happens if a specific pa-side question lands; the bigger ROI is audreyinc-side.

Likely audreyinc image substrates worth processing (pending Dan/Audrey-confirmed inventory):

Substrate	What L1 extracts	What L2 normalizes	Downstream queries
Vendor / supplier invoices	vendor, date, line items, totals	canonical supplier names, fabric/material/component categories, unit costs	“Cost per yard of cotton across mills · year over year”
Trade-show / wholesale order forms	customer, date, SKUs, quantities	canonical customer names + customer territory + product category	“Boutiques in NY ordering X over time”
Production / pattern notes (handwritten)	style ID, sizing, fit notes, sample iterations	canonical style code, season, collection	“All fit revisions for style #423”
Customer / press contact cards (trade shows)	name, brand, role, contact	merged-into-CRM canonical contacts	“Press contacts from Cologne 2024 we haven’t followed up with”
Wholesale / consignment statements	retailer, period, units sold, payout	retailer normalize + product joins	“Top retailers by units · sell-through rates”
Returns / RMA notes (handwritten)	order#, reason, condition	canonical reason categories	“Top return reasons by style · year”
Photoshoot logs	shoot date, photographer, location, garments	canonical garment SKUs	“Which styles have lookbook coverage”
Archived paper sales records pre-Shopify	every order line	normalized SKUs	“Pre-2015 catalog reconstruction for archive”

This connects to existing audreyinc work in memory:

parked_sketch_shopify_full_lifecycle_2026-05-23.md — Shopify side of the audreyinc surface; vision-OCR feeds the pre/parallel-Shopify paper trail into the same substrate
feedback_audrey_eras_few_clicks_deep.md — archive eras snapshot project; OCR of physical archive feeds the same era-by-era narrative
feedback_email_ingest_pattern_extends_to_ebay.md — sibling extraction pattern for digital substrates; vision-OCR is the analog-document counterpart

Vehicle receipts — validation scope only

The 138-receipt run above was the proof-of-mechanism. The actual vehicle-side work (cleanup, charting on pa.gf.cx/vehicles/) only proceeds if a specific pa question lands:

Selling a vehicle (need maintenance history packet)
Insurance review window
“What did the LR4 cost per mile in 2024?” or sibling question
Kitchen-reno / claim-packet build needs the receipt substrate

Resume / build trigger

Status: PARKED post-validation 2026-05-24. Sample run proved the mechanism works at acceptable quality and cost on vehicle receipts. Wait condition before unparking:

audreyinc data substrates inventoried — Dan/Audrey identify which boxes of paper/scanned material are worth processing first
A specific audreyinc question that needs the substrate (“trade-show contacts from 2024 we never followed up with”, “wholesale sell-through by retailer”, “vendor cost per fabric category”)
pa-side trigger (insurance / sell / claim) as above

When unparked for audreyinc: scope and cost depend on inventory size + which substrates (vendor invoices alone is a different size than full archive). The pipeline pattern (filter → preprocess → vision L1 → optional L2 → query) is the same; the per-substrate cost scales with image count.

The aphorism

The data isn’t missing. It’s just inside the image. Vision-LLM is the can-opener. The second pass is the categorizer that turns the can’s contents into ingredients.

Source: parked_sketch_receipt_vision_ocr_vehicle_history_2026-05-24.md · Rendered 2026-05-26 17:10