External vendor-lookup as Layer-2 categorization assist (parked 2026-05-23)

DARE.CO.UK · PARKED SKETCH · 2026-05-23

Mirrored from ~/.claude/.../memory/parked_sketch_vendor_lookup_layer2_2026-05-23.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

Rules + regex hit their plateau on ambiguous vendor names (“Heroku”, “Bona”, “HomePod”). Next tier: external API or LLM looks up what the vendor IS before categorization. Same architecture as feedback_two_layer_decision_architecture.md — Layer 1 rules for clean cases, Layer 2 lookup for residue.


Dan 2026-05-23: “external api naming look-up, like Heroku, is always worth assessing”

The problem

Rule-based vendor → category mapping has a ceiling. Examples that defeat regex:

Vendor Naive guess What it actually is
Heroku (no idea) Cloud hosting → Software / Web Services
Bona (no idea) Floor cleaning supplies → Hardware / Consumables
HomePod (no idea) Apple consumer electronics → Computer & Electronics
Hakkapeliitta (no idea) Nokian winter tires → Car Maintenance & Repair

Rules can’t enumerate every product name. LLM/external-API lookup can.

Architecture (Layer-2 lookup)

Layer Source When
1 Regex rules (current) Clean cases: named vendor (Geico, Starbucks), vendor+context combo
2 LLM/API lookup Ambiguous: rule misfired OR low-confidence, vendor name is unknown noun
3 Manual hand-mapping Edge cases the LLM can’t disambiguate

Lookup options (ranked by fit)

Option Cost Fit
Haiku batch over vendors ~$0.001/lookup, $5-10 for full Harvest residue Best fit — fast, generic, no API setup. Prompt: “Vendor=’X’, notes=’Y’, amount=$Z, current category=’C’. From this list of Harvest categories: […], which fits best? Reply: category
Wikidata / DBpedia Free, free-form text query Coverage gaps; works for famous brands, fails for SaaS / regional vendors
Clearbit / FullContact $99-299/mo Overkill for personal use; better for sales prospecting
Brand database (private) varies Not generally available

Recipe for v1 (when triggered)

# pa_harvest_recategorize_llm_layer2.py
# For every expense without a rule-based proposal (or low-confidence):
#   1. Build prompt: vendor + notes + amount + current category + candidate list
#   2. Batch call to Haiku (cheap, fast)
#   3. Parse: {expense_id, llm_category, llm_confidence, llm_reasoning}
#   4. Append to recategorization_proposal.{md,jsonl}
#   5. Operator review (same UI)

Estimated cost on full Harvest substrate residue (~5,500 expenses minus rule-handled ~80): - ~5,400 LLM calls × $0.001 = ~$5 one-time - ~30 min wall-clock with reasonable batching (50/request)

When to build

Cross-references

The aphorism

Rules know syntax. LLMs know what things mean. Spend rules on shape; spend LLM on meaning.

Source: parked_sketch_vendor_lookup_layer2_2026-05-23.md · Rendered 2026-05-23 23:14