Bulk content cleanup pipeline — sketch parked for Evernote MCP + client engagements
DARE.CO.UK · PARKED SKETCH · 2026-05-18
Mirrored from ~/.claude/.../memory/project_bulk_content_cleanup_pipeline_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
2026-05-18 sketch. Multi-stage pipeline for ingesting messy content corpora (Evernote 24k notes, client directories of HTML/PDF/Word/scans), normalizing into a common shape, applying renaming + SEO-friendly slugification + tag inference + dedup + pruning candidates, and exporting a clean output corpus. Personal trigger: Dan’s 24k stale Evernote records when MCP releases. Real value: identical engine handles “client gives you a folder of legacy content” — the recurring shape of fix-up engagements.
Trigger for unparking: first qualifying input materialises — - Evernote MCP server is publicly released AND Dan wants to pull a 1% test slice (~240 of 24k records), OR - A client engagement lands with a “here’s the directory of legacy content, normalize it” ask, OR - A portfolio brand’s own archive needs the same treatment (dare’s WordPress export already passed through this kind of process; dogwood/audrey will).
Goal: turn N records of messy provenance into N’ records (N’ ≤ N after dedup + pruning) with:
- Descriptive, slugified filenames (per feedback_internal_seo.md)
- Normalized metadata (created/updated dates, tags, source URLs, attachments)
- Inferred classification (type, topic, freshness, action-required)
- Audit trail showing every transformation + uncertain decisions human-reviewable
- Pruning candidate list (obvious stale / empty / duplicate) for review-before-delete
Architecture — 5-stage pipeline
┌────────┐ ┌───────────┐ ┌───────┐ ┌─────────┐ ┌────────┐
│ INGEST │ → │ NORMALIZE │ → │ AUDIT │ → │ APPLY │ → │ EXPORT │
└────────┘ └───────────┘ └───────┘ └─────────┘ └────────┘
↓ ↓ ↓ ↓ ↓
source- common per-record transformed output
specific record proposal records + corpus
adapter shape + flags audit trail (md folder
+ index)
Stage 1: INGEST — source-specific adapters
Plugin layer per source-of-content. Each adapter exposes iter_records() → yields raw records in source-native shape.
FIRST MOVE WHEN UNPARKING (Dan, 2026-05-18 evening): before writing ANY adapter, do a GitHub sweep for existing open-source work:
evernote mcp/enex parser/enex to markdown— high probability someone else is already maintaining this, especially once the Evernote MCP server is announced. Adopt their work + contribute back any portability fixes; don’t re-derive.- Specific candidates to look at:
evernote2md(well-maintained),enex-dump, anymcp-server-evernote/evernote-mcp-serverrepos that materialise post-release. - Same instinct applies to Notion / Dropbox / iCloud adapters when those are added.
The build-vs-borrow calculus (per feedback_evaluate_managed_services_before_build.md): if a maintained OSS project handles the ingest reliably, our adapter shim is ~20 LOC wrapping their CLI output into our normalized record shape. If we’d be the maintainer of source-specific parsing logic, we own a debt — every Evernote format change is our problem. Reuse the format-handling layer; OWN the audit + apply + classification logic, which is where the value lives anyway.
Initial adapters:
- Evernote MCP — when released, expose notes via MCP server. Adapter likely wraps an existing evernote2md / enex-dump style tool’s output, then re-shapes into our normalized record.
- Generic directory — walks a folder of .md / .html / .pdf / .docx / .txt, extracts text via Pandoc / pdfplumber / docx parsers (all existing OSS — same reuse stance).
- Notion export — likely-future; Notion’s export is structured enough for direct parsing; check for notion-to-md style tools.
- Dropbox folder — same shape as generic directory; the Dropbox MCP (when stable) replaces filesystem walk with API.
- iCloud Notes — harder (no good export); deferred.
Stage 2: NORMALIZE — common record shape
Every record across all adapters lands in one shape:
{
"id": str, # source-native ID (stable for re-runs)
"title": str, # source-native title (may need re-derivation)
"body": str, # canonical markdown body
"created": date, # earliest known date (filename / file mtime / Evernote / content)
"updated": date, # latest known date
"tags": list[str], # source-native tags
"source_url": str | None, # original URL if web-clipped
"attachments": list[dict], # {filename, mime, content_bytes_or_uri}
"raw_meta": dict, # everything else from the source, preserved for traceability
}
This is the contract — every downstream stage works against this shape only. Adapters are the only thing that knows source-native quirks.
Stage 3: AUDIT — per-record proposals + flags
For each normalized record, compute proposed transformations WITHOUT applying them:
- Title cleanup: strip CleanShot-style autonames, expand contractions, sentence-case, drop bracketed metadata. Output:
proposed_title. - Slug: lowercase-hyphenated from
proposed_title. Output:proposed_slug. Collision-check against other records in this batch. - Date inference: Look at created/updated/filename/body for the canonical date. If multiple sources disagree, flag for review. Output:
proposed_date+date_confidence(high/medium/low). - Tag inference: Combine source tags + content-derived tags (topic classification via Claude API on snippet, type classification — note / clipping / receipt / reference / todo / ephemera). Output:
proposed_tags+tag_source(manual/inferred). - Body cleanup: strip Evernote/WordPress chrome, normalize whitespace, fix smart-quotes, anglicize spellings if dare-bound. Output:
proposed_body. - Classification: evergreen / time-bound / stale / duplicate-suspect / empty-fragment. Output:
classification. - Pruning candidate flag: True if any of (no body beyond URL + no tags + no inbound references + last-touched > 5 years).
- Duplicate cluster: content-hash + near-duplicate detection (Levenshtein on titles, simhash on bodies). Output:
duplicate_cluster_idif grouped.
Output of audit stage:
- ~/Downloads/dare_content_cleanup_audit_<DATE>.md — human-readable summary
- ~/Downloads/dare_content_cleanup_proposals_<DATE>.json — machine-readable per-record proposals
The audit ships first; apply waits for human review per the dare_lost_image_audit pattern.
Stage 4: APPLY — --apply rewrites
Reads the proposals JSON, applies transformations, writes the output corpus. Idempotent — re-running with the same input doesn’t change anything if proposals haven’t shifted.
- Renaming happens here (filename =
{proposed_date}-{proposed_slug}.md). - Body rewrites happen here (the proposed_body replaces the raw_body).
- Tag updates happen here.
- Pruning candidates are NOT auto-deleted — they get moved to a
_pruning/subdir for human review-before-delete (the lost-cats-stray-cats pattern: never lose anything irreversibly). - Duplicate clusters get a
<cluster_id>/subdir containing all variants + aMERGE_HERE.mdplaceholder for human review.
Stage 5: EXPORT — output corpus shape
Final shape (configurable per use case):
output/
├── 2026-05-18-keira-knightley-portrait-recovery.md
├── 2026-05-17-edge-health-toggle-shipped.md
├── ...
├── _pruning/
│ ├── 2018-03-04-empty-clip-from-medium.md (3 lines, no body)
│ └── 2019-11-22-amazon-receipt-12345.md
├── _duplicates/
│ ├── cluster_001/
│ │ ├── 2020-06-12-keira-clip-v1.md
│ │ ├── 2020-06-13-keira-clip-v2.md
│ │ └── MERGE_HERE.md
├── _index.md # tag index + classification breakdown
└── _audit-trail.md # every transformation log line
Configurable: per-tag subdirectories, hierarchical by year/month, hybrid.
Cross-portfolio applicability
The pipeline is the same engine for several recurring needs:
| Use case | Source | Output shape |
|---|---|---|
| Dan’s Evernote 24k cleanup | Evernote MCP | Personal knowledge corpus → flat md folder, tags index |
| Client legacy-content engagement | client-supplied directory | Client deliverable — restructured corpus + audit |
| dare archive maintenance | WordPress export sub-folder | Same as today’s dare_migrate_articles, refactored as pipeline |
| dogwood photo + caption archive | Dropbox folder | Dog journal entries, tagged + dated |
| audrey product-description rewrite | Shopify CSV export | Listings → clean markdown, tagged by collection |
The adapter abstraction is what makes it portable. Each new use case = one new adapter + maybe one classification-rule tweak.
Toolkit naming + conventions
- Script name:
dare_content_cleanup.py(dare_prefix because canonical home is dare’s toolkit; the engine is brand-agnostic but the host is dare’s) - Per-source CLI:
--source evernote|directory|notion|dropbox - Per-target CLI:
--out <path>(defaults~/Downloads/dare_content_cleanup_output_<DATE>/) - Audit-first:
--dry-rundefault (perfeedback_audit_first_then_batch.md) - Apply: explicit
--apply - Sub-modes:
--audit-only,--apply-only(read existing proposals JSON)
Open design questions (decide when unparking)
- Classification quality. Topic/type inference via Claude API will be ~80% right. Worth pre-flighting on a 50-record sample before committing to a model + prompt.
- Dedup threshold. Simhash + Levenshtein cutoffs need calibration. Too aggressive = lose distinct records; too loose = miss real dupes.
- Pruning conservatism. A “delete me” recommendation should require multiple signals (no body + no tags + > 5 years + no inbound). Single signal = too aggressive.
- Body classification of attachments. Receipts, scans, PDFs — text extract or store as-is? Probably text-extract for search, store-as-is for fidelity.
- Reversibility. Output corpus + audit-trail.md should be enough to reconstruct the original state. Test on a 100-record sample before running on 24k.
Why this is high-value for client work
Every legacy-site / legacy-corpus engagement starts with the same friction: “here’s the messy reality, what do we do with it?” The pipeline turns that into a structured deliverable: - Week 1 of an engagement → audit. Stakeholders see proposed structure before any change. Builds trust. - Week 2 → apply on a sample (say, 10% of corpus). Stakeholders confirm direction. - Week 3+ → apply on full corpus + iterate on classification rules per stakeholder feedback.
The audit trail IS the engagement’s deliverable. Self-documenting work.
Sibling memories
feedback_internal_seo.md— the naming-discipline foundation this pipeline implements at bulk scale.feedback_audit_first_then_batch.md— the audit-then-apply pattern this pipeline structurally embeds.feedback_audit_js_dom_coupling_before_canonical_patches.md— see-once-audit, see-twice-build. Dan’s 24k is the first concrete user; the script earns its build cost on first real use.project_dare_messaging_service_v1_built.md— same shape as notify-portfolio: thin shim over substrate work, portable per-source.user_lost_cats_stray_cats_archival_recovery.md— the never-lose-anything-irreversibly stance;_pruning/subdir + human review is its application here.feedback_evaluate_managed_services_before_build.md— before building, check whether off-the-shelf tools (Logseq import, Obsidian importer, NotePlan converter) handle the use case. Last time I checked, none handle the AUDIT-first step or the cross-source adapter layer; the build earns its keep there.
Resume conditions
- ✅ Evernote MCP released (Dan’s primary trigger).
- ✅ First client engagement that includes “normalize this content directory” in scope.
- ✅ Portfolio brand archive needs a structured cleanup pass (e.g. dogwood’s existing photo + caption directory).
- Earliest qualifying trigger gets the V1 build; subsequent triggers exercise the adapter-portability claim.