Evernote one-shot ingester → pa substrate + R2 (parked sketch 2026-05-22, REVISED 2026-05-23)

DARE.CO.UK · PARKED SKETCH · 2026-05-22 · REVISED 2026-05-23

Dan since 2009. ~24,000 notes. Heavy video. ~100GB+ total estimated. One-shot ingest into pa/_substrate/evernote/ with markdown + metadata in git, attachments + videos streamed direct-to-R2 with manifest pointers. Same Haiku Layer-2 enrichment downstream as the Amazon archive.


🚨 REVISED 2026-05-23 — API path closed, ENEX path is the way in

Evernote Support 2026-05-23: “At this time, we’re no longer issuing new Evernote API keys. However, we’re actively working on a new MCP (Model Context Protocol) integration…”

Dan applied for an API key 2026-05-22; the response confirmed Evernote has frozen new key issuance entirely. Existing token holders keep working; new applicants get nothing. The official Evernote MCP is in development with no ETA.

This kills the API-based ingester sketched in the original draft. The probe (~/bin/pa_evernote_probe.py) and the API-calling export script are parked indefinitely. Community MCPs (SqREL, verygoodplugins) all rely on tokens Evernote no longer issues — same dead end for new users.

New path forward — ENEX export from the desktop app. ENEX is Evernote’s native XML export format. Right-click notebook → Export Notebook → ENEX. Each notebook = one .enex file. Resources (attachments, images, videos) are base64-embedded inline. Encrypted blocks preserved verbatim. Manual click per notebook (~30 clicks total for a typical mature account).

What changes vs the original sketch: - ❌ API auth (Developer Token, 1Password ref) — not happening - ❌ Rate-limit handling — moot (offline parse) - ❌ Resumability against API state — moot (ENEX is a single file) - ❌ Note history / listNoteVersions — ENEX is current-state-only; this dimension is lost - ❌ Linked notebooks / business notebooks (via separate API paths) — would require ENEX exports of each notebook owner’s side - ✅ Everything ELSE stays — R2 streaming, image SEO cleanup, Haiku Layer-2 enrichment, diamond-tier ranking, the entire downstream pipeline - ✅ Add a one-time pain step: 30-ish manual clicks in the Evernote app to produce ENEX files; drop them in ~/Code/home-projects/pa/evernote/_inbox/; pipeline takes over

Cost picture unchanged: ~$2.25/month R2 storage at 150GB. The one-shot ingest cost is now CPU + bandwidth, no API quota.

Build status 2026-05-23: - 🚨 ~/bin/pa_evernote_probe.py — DEAD, parked (would need API key) - 🆕 ~/bin/pa_evernote_enex_ingest.py — NEXT to build (no SDK, just Python stdlib XML) - ✅ Registered interest on the official Evernote MCP waitlist at evernote.com/mcp; described the 24K-note + R2 + Haiku-enrichment + diamond-ranking use case (Evernote uses these to prioritize MCP features)

If/when the official Evernote MCP ships, we can ADD it as a Layer-3 enrichment channel (real-time read/write via Claude) on top of the ENEX-based substrate; ENEX gives us the bulk historical pull, MCP gives us the live channel. Best of both.


Why park this

We need to ship the Amazon Layer-2 render surfaces first (the data is sitting in layer2_tags.jsonl waiting to be lifted into the dashboard). Evernote ingester is next in queue but a 1-3 hour wallclock + multi-GB R2 storage commitment — worth designing properly before running.

Scale (Dan’s actual numbers)

Wallclock + cost estimates

Job Without videos With videos (heavy)
Metadata-only scan (findNotesMetadata paginated, no body) ~5 min ~5 min
Full body + small attachments (concurrent=5) ~2-3 hours n/a
Full body + ALL attachments + videos (concurrent=3, throttled by bandwidth) n/a 6-12 hours
API call budget well within 80K/hr well within 80K/hr

Bandwidth is the constraint, not the API. Evernote serves resources from their CDN at reasonable speeds, but 100 GB at ~50 Mbps home connection = ~4-5 hours just bytes-down.

Cost

Architecture

Two-tier write strategy

STASH9

The .resources.json per note carries the manifest:

{
  "note_guid": "abc123...",
  "resources": [
    {
      "sha1": "deadbeef...",
      "filename": "kitchen-reno-walkthrough.mp4",
      "mime": "video/mp4",
      "bytes": 487234123,
      "r2_url": "https://pa-evernote-substrate.r2.dev/<nb>/<note>/deadbeef.mp4",
      "duration_sec": 187,
      "recognition_text": null,
      "encrypted": false
    },
    ...
  ]
}

Git diff stays tiny + readable; clone size stays manageable; videos remain queryable via the manifest + streamable from R2.

Streaming uploads (no local intermediate for big files)

# Pseudocode
for resource in note.resources:
    if resource.size_bytes < 1_000_000:           # <1MB: store inline in git
        local_path = save_to_git(resource.data)
    else:                                          # large: stream Evernote → R2
        r2_url = stream_to_r2(
            source_stream=evernote.getResourceData(resource.guid),
            bucket="pa-evernote-substrate",
            key=f"{notebook_guid}/{note.guid}/{resource.hash}.{ext}"
        )
        manifest.append({"sha1": resource.hash, "r2_url": r2_url, ...})

Stream means no 500 MB file ever touches local disk in full — just chunks flowing through. Critical for the bigger videos.

Resumability

So a 6-hour run that crashes at hour 4 picks up at hour 4 + 5 minutes on restart.

Your 5 questions (locked in from the conversation)

1. How big can it handle?

Tested limits: - 24K notes / 100+ GB attachments: comfortable with this architecture - API rate: well under 80K calls/hr budget - Disk: bounded only by R2 storage (effectively unlimited at $0.015/GB) - Git repo size: stays under 500 MB even for heavy users since binaries go to R2

2. How quick?

For Dan’s scale (24K notes, heavy video): - Metadata scan: 5 min - Full pull: 6-12 hours (bandwidth-bound) - Run overnight; resumable so a single interrupted run isn’t fatal

3. Will it back up to GitHub?

Yes — markdown + metadata + ENML + manifests all commit to git in pa/_substrate/evernote/. R2 holds the bytes; git holds the index. Same pre-push hook on pa.gf.cx auto-deploys the renderer surfaces.

4. Encrypted blocks (<en-crypt>)

Preserved verbatim. Body markdown shows 🔒 [encrypted block — hint: "<hint>"] as a visual placeholder. Full encrypted blob lands in <note>.encrypted.json for offline batch-decrypt later (you supply the password, never leaves your machine).

5. Note versions / history

Requires Evernote Personal or Professional tier. If you have it: - For each note: listNoteVersions(guid) → 5-50 historical revisions (capped by Evernote’s retention window) - Each version: getNoteVersion(guid, usn) → full snapshot at that time - Lands as versions/v_<usn>_<timestamp>.md per note - Resources at each version also pulled — Evernote keeps the binaries for the retention window

Crucial: pulling NOW locks in every historical revision Evernote currently retains. Subsequent edits will produce NEW versions over time, but you’ve captured a complete snapshot of 17 years’ editing history.

If on free tier: only current version is available.

Discovery + opt-in scopes

pa_evernote_export.py --scope sizes              # 5-min metadata scan, prints what we'd pull
pa_evernote_export.py --scope notebooks          # comma-separated list of notebooks
pa_evernote_export.py --scope all                # everything (default)
pa_evernote_export.py --include-trash
pa_evernote_export.py --include-linked           # notebooks shared TO you
pa_evernote_export.py --include-business         # business-account notebooks
pa_evernote_export.py --skip-attachments         # markdown-only mode (~30 min for 24K)
pa_evernote_export.py --skip-versions            # current note only, no history
pa_evernote_export.py --resume                   # default behavior; pick up where left off

Auth + secrets

Image SEO + discoverability pipeline (critical sub-system)

Raw Evernote images dumped as-is would be unsearchable + unrenderable: IMG_2783.heic, no alt text, 5MB iPhone-resolution, HEIC format browsers don’t natively support. Without cleanup, the substrate is opaque.

The cleanup pipeline runs per-image, during the R2 upload streaming:

  1. Format normalization — HEIC/HEIF/TIFF/PNG-with-alpha → JPEG (or WebP for transparency-needed). Lossless for screenshots/diagrams (PNG kept); lossy with quality 85 for photos.

  2. Resize — max 2000px on long edge, preserve aspect. Original kept in R2 under <sha1>.original.<ext> for archival; web-optimized version becomes the canonical <sha1>.<ext>. Two URLs per image, manifest carries both.

  3. EXIF handling — strip camera serial + lens metadata (potential PII); preserve DateTimeOriginal, GPSLatitude, GPSLongitude, Orientation. The orientation tag drives correct display per the banked feedback_iphone_photo_orientation_handling.md rule (never auto-rotate).

  4. SEO filename — sha1 hash is fine for cache integrity but useless for discovery. Generate a human-readable slug: IMG_2783.heic → 2018-03-15_kitchen-renovation_marble-countertop-detail.jpg ↑date from EXIF ↑note slug ↑Haiku vision caption Sha1 lives in the manifest; human filename is the primary R2 key.

  5. Alt-text generation via Haiku vision — one call per image: prompt: "Generate a concise descriptive alt-text (max 12 words) suitable for SEO + accessibility. Plus 3-5 keyword tags. Plus a one-sentence caption." Output saved to manifest: json { "alt": "Marble kitchen countertop with veining detail, overhead shot", "tags": ["kitchen", "marble", "countertop", "renovation", "interior-design"], "caption": "Close-up of the marble countertop installed during the 2018 kitchen renovation.", "haiku_vision_cost_usd": 0.0008 } Cost: ~$0.001/image × est. 10K images in 24K notes = ~$10 one-shot. Worth it.

  6. Per-note context injection — when generating alt-text, also feed Haiku the surrounding note body so the description reflects context. “Marble countertop” inside a “Kitchen renovation 2018” note vs inside a “Stone supplier samples” note gets meaningfully different captions.

  7. OCR for text-heavy images — if Haiku-vision detects substantial text in the image (receipts, screenshots of articles, whiteboard photos), trigger a follow-up OCR pass via Haiku’s built-in vision OCR. Text goes into the manifest as recognition_text (Evernote calls these “Resource Recognition Indices” and provides its own OCR — we should preserve that too as evernote_recognition_text for comparison).

  8. Web-renderable references — in the converted markdown note body, replace Evernote’s <en-media hash="..."> with: markdown ![Marble kitchen countertop with veining detail, overhead shot](https://pa-evernote-substrate.r2.dev/2018-03-15_kitchen-renovation_marble-countertop-detail.jpg) *Close-up of the marble countertop installed during the 2018 kitchen renovation.* That renders cleanly on pa.gf.cx, is indexable by Google, and the alt-text is meaningful.

Videos get a parallel pipeline

Total image-cleanup cost estimate

Assuming ~10K images + ~500 videos in Dan’s 24K-note archive: - Image vision: 10,000 × $0.001 = ~$10 - Video thumbnails: 500 × $0.001 = ~$0.50 - Video transcription (optional): 500 × avg 3 min × $0.05/hr = ~$1.25 - ffmpeg + resize processing: free, local CPU - Total Layer-2 image enrichment: ~$12 one-shot

Stacked on the $5.50 Amazon Layer-2 cost from today: still cheap relative to the time it would take to do this manually for 10K images.

Why this is non-negotiable

Without this cleanup: - Browser can’t render HEIC → broken images all over the renderer - IMG_2783.jpg filename gives Google nothing → zero discoverability - 5MB images × 10K = 50GB egress every time someone browses the archive - No alt-text → fails accessibility + Google’s image search - No tags → the Haiku Layer-2 enrichment on the NOTES has nothing to bind images to

With it: every image is web-renderable, SEO-discoverable, AI-described, and connectable to the surrounding note’s semantic enrichment. The substrate becomes genuinely useful instead of just “preserved.”

Finding the 20% diamonds in 80% noise (the value-extraction layer)

24K notes over 17 years means most are scraps: half-typed thoughts, abandoned drafts, single-line URLs, screenshots whose context is gone. The ingester preserves everything, but the renderer must SURFACE the diamonds and DEMOTE the noise — not delete (preservation is the whole point), just rank.

This isn’t a deletion problem. It’s a curation + ranking problem.

Per-note value score (computed during Layer-2 enrichment)

One Haiku call per note returns the standard schema PLUS a value_signal block:

"value_signal": {
  "score": 0.78,                     // 0.0-1.0 composite
  "factors": {
    "depth": 0.85,                   // word count + structural richness (lists, headings)
    "uniqueness": 0.90,              // not a near-duplicate of another note
    "recency": 0.20,                 // last_updated decay (recent = higher)
    "reference_quality": 0.80,       // links to external sources / quotes
    "personal_voice": 0.95,          // Dan's writing vs pasted content vs auto-import
    "actionability": 0.60,           // has todos, reminders, deadlines, decisions
    "evergreen": 0.90                // would this still be useful in 5 years? (recipes/refs HIGH)
  },
  "category_signal": "reference",    // reference | journal | scrap | inbox | meeting | recipe | etc.
  "diamond_reason": "Long-form first-person travel notes with itinerary, costs, and contact details for stays — actionable + irreplaceable + dated."
}

The diamond_reason is the killer field — it’s the LLM’s explanation of WHY this note is valuable, which Dan can eyeball-validate at scale.

Tier classification (derived from score)

Tier Score band Treatment
💎 Diamond (top 5%) 0.85+ Hero surfacing — featured on landing, included in digests, never archived
⭐ Gold (next 15%) 0.65-0.85 Default browse view; full search indexing; surfaced in topic landings
📋 Reference (next 30%) 0.40-0.65 Searchable + indexed but not surfaced unless searched
📦 Archive (next 40%) 0.20-0.40 Hidden from default views; full-text searchable; available via “show all”
🗑 Noise (bottom 10%) <0.20 Preserved in R2 + git but excluded from rendered surfaces by default

Nothing is deleted. Everything is searchable. But the DEFAULT VIEW shows the diamonds + gold + indexed-by-search. The noise becomes findable when you specifically look for it.

Discovery surfaces (where diamonds get re-found)

  1. /evernote/diamonds/ — landing page of the top 5% across the entire archive. Sorted by value_signal.score descending. Each card shows: thumbnail, title, year, diamond_reason quote, category tag.

  2. /evernote/diamonds/<year>/ — yearly diamond rollup. “The best 50 things you wrote in 2014.” Becomes a retrospective surface.

  3. /evernote/diamonds/by-category/recipes/ — diamonds filtered by category. Recipes you cooked + loved bubble up; one-off “should I try this?” links sink.

  4. /evernote/timeline/ — 17-year chronological scroll, but only diamond + gold tiers shown by default. Toggle reveals lower tiers.

  5. /evernote/_resurface/ — weekly cron picks 10 random notes from the archive tier, scores them against current context, and surfaces them on a “rediscovered” page. Sometimes the noise tier hides a forgotten gem; rotating surfacing catches it.

  6. /evernote/_diagnostics/value-distribution.html — histogram of value scores + per-tier counts + diamond-reasons listed. Lets Dan calibrate: “Wait, why did THIS get diamond and THAT didn’t?” — feeds the next scoring iteration.

Workflow: Dan-validated diamond curation

For diamonds specifically, add a lightweight confirmation UI on each note page:

💎 This note scored 0.87.  Diamond reason:
   "Long-form first-person travel notes with itinerary, costs, contact 
    details for stays — actionable + irreplaceable + dated."

   [✓ Confirm diamond]   [⬇ Demote to gold]   [⬇ Demote to archive]

Confirmations write to pa/_substrate/evernote/_curation.jsonl (a simple append-only log). The renderer respects manual overrides over the LLM score — so Dan’s call always wins.

This becomes training signal: the next Haiku pass (every 6 months as taxonomy shifts) can read _curation.jsonl and adjust scoring weights to match Dan’s revealed preferences.

Cross-substrate boosts (the join that finds hidden value)

Notes don’t live alone — they reference brands, places, people, vehicles, claims. Compute a “cross-substrate density” score:

A note that ties Amazon-spend to vehicle-ownership to a specific date to a photo on the property — that’s a 5-axis crossroads, definitely diamond.

Duplicate / near-duplicate detection

Cluster notes by: - Title similarity (Jaccard on tokens) - Body MinHash / SimHash for near-duplicates - Same attachment sha1 (often a screenshot saved into multiple notes)

Within a cluster, keep the highest-scoring as canonical; mark the rest as dup_of: <canonical_guid>. The duplicates don’t disappear — they’re just collapsed in the default view with a “+3 similar” pill.

Junk-suspect signals (auto-demote)

The 10% noise tier accumulates notes matching ANY of: - Body is just a single URL (no commentary) — these are bookmarks; pivot to a “Bookmarks” surface separately - Body is “Untitled” + <20 words + no attachments - Body is a clearly-auto-imported web clip with no edits since clipping - Note hasn’t been opened/edited in 7+ years AND has no tags AND no reminders AND no attachments

None deleted; just demoted out of default browse.

Cost of the value-scoring pass

Already absorbed in the per-note Haiku Layer-2 call (the value_signal block adds ~200 tokens to the response). No extra calls needed. So the 24K-note enrichment cost stays in the $25-35 range total (notes Layer-2 + images Layer-2 + videos Layer-2 + thumbs + captioning).

The “1% gold” surface

Within the 5% diamonds, the top 1% (≈240 notes) is the irreducible cultural artifact of 17 years of Dan’s thinking. Worth a hand-curated experience: maybe a printed book, maybe a dedicated /evernote/canon/ page with annotated commentary, maybe an Opus-summarized “what you’ve learned” retrospective. That’s downstream of the substrate landing — but worth noting up front so the substrate captures enough metadata to support it (writing style, recurring themes, evolution over time).

After ingest — downstream surfaces

Once substrate lands, the existing Haiku Layer-2 pattern applies one-shot:

pa_purchases_layer2.py --source evernote --run

Per-note rich-JSON enrichment: notebook taxonomy, content type (receipt | recipe | meeting-note | inspiration | reference | journal | photo-log), entities mentioned, places, dates, people, sentiment, action-items present, attachments referenced. Same auto_tags_l2 field pattern. Same disagreement-driven Layer-1 expansion afterward.

Then renderer surfaces: - /evernote/ — dashboard (notebook tree, count + size + last-touched per notebook) - /evernote/notebooks/<slug>/ — notebook landing - /evernote/notes/<date>_<slug>/ — per-note page (markdown body + resource grid + version timeline) - /evernote/_diagnostics/notebook-coverage.html — equivalent of address-audit for notebook attribution - /evernote/tags/<tag>/ — tag-listing pages - /evernote/timeline/ — 17-year chronological scroll (since-2009 substrate is gold for retrospectives) - /evernote/locations/ — geo scatter of notes that have lat/lon (mobile-captured)

Build trigger

After the Amazon Layer-2 render surfaces ship. ETA: tomorrow morning if focused; otherwise queued behind the eBay ingest + 2011-2015 audrey eras sweep.

Cross-references

Source: parked_sketch_evernote_one_shot_ingester_2026-05-22.md · Rendered 2026-05-23 00:44