Bulk content cleanup pipeline — sketch parked for Evernote MCP + client engagements

DARE.CO.UK · PARKED SKETCH · 2026-05-18

Mirrored from ~/.claude/.../memory/project_bulk_content_cleanup_pipeline_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

2026-05-18 sketch. Multi-stage pipeline for ingesting messy content corpora (Evernote 24k notes, client directories of HTML/PDF/Word/scans), normalizing into a common shape, applying renaming + SEO-friendly slugification + tag inference + dedup + pruning candidates, and exporting a clean output corpus. Personal trigger: Dan’s 24k stale Evernote records when MCP releases. Real value: identical engine handles “client gives you a folder of legacy content” — the recurring shape of fix-up engagements.

Trigger for unparking: first qualifying input materialises — - Evernote MCP server is publicly released AND Dan wants to pull a 1% test slice (~240 of 24k records), OR - A client engagement lands with a “here’s the directory of legacy content, normalize it” ask, OR - A portfolio brand’s own archive needs the same treatment (dare’s WordPress export already passed through this kind of process; dogwood/audrey will).

Goal: turn N records of messy provenance into N’ records (N’ ≤ N after dedup + pruning) with: - Descriptive, slugified filenames (per feedback_internal_seo.md) - Normalized metadata (created/updated dates, tags, source URLs, attachments) - Inferred classification (type, topic, freshness, action-required) - Audit trail showing every transformation + uncertain decisions human-reviewable - Pruning candidate list (obvious stale / empty / duplicate) for review-before-delete

Architecture — 5-stage pipeline

┌────────┐   ┌───────────┐   ┌───────┐   ┌─────────┐   ┌────────┐
│ INGEST │ → │ NORMALIZE │ → │ AUDIT │ → │  APPLY  │ → │ EXPORT │
└────────┘   └───────────┘   └───────┘   └─────────┘   └────────┘
   ↓             ↓               ↓            ↓             ↓
 source-       common         per-record   transformed    output
 specific      record         proposal     records +      corpus
 adapter       shape          + flags      audit trail    (md folder
                                                          + index)

Stage 1: INGEST — source-specific adapters

Plugin layer per source-of-content. Each adapter exposes iter_records() → yields raw records in source-native shape.

FIRST MOVE WHEN UNPARKING (Dan, 2026-05-18 evening): before writing ANY adapter, do a GitHub sweep for existing open-source work:

evernote mcp / enex parser / enex to markdown — high probability someone else is already maintaining this, especially once the Evernote MCP server is announced. Adopt their work + contribute back any portability fixes; don’t re-derive.
Specific candidates to look at: evernote2md (well-maintained), enex-dump, any mcp-server-evernote / evernote-mcp-server repos that materialise post-release.
Same instinct applies to Notion / Dropbox / iCloud adapters when those are added.

The build-vs-borrow calculus (per feedback_evaluate_managed_services_before_build.md): if a maintained OSS project handles the ingest reliably, our adapter shim is ~20 LOC wrapping their CLI output into our normalized record shape. If we’d be the maintainer of source-specific parsing logic, we own a debt — every Evernote format change is our problem. Reuse the format-handling layer; OWN the audit + apply + classification logic, which is where the value lives anyway.

Initial adapters: - Evernote MCP — when released, expose notes via MCP server. Adapter likely wraps an existing evernote2md / enex-dump style tool’s output, then re-shapes into our normalized record. - Generic directory — walks a folder of .md / .html / .pdf / .docx / .txt, extracts text via Pandoc / pdfplumber / docx parsers (all existing OSS — same reuse stance). - Notion export — likely-future; Notion’s export is structured enough for direct parsing; check for notion-to-md style tools. - Dropbox folder — same shape as generic directory; the Dropbox MCP (when stable) replaces filesystem walk with API. - iCloud Notes — harder (no good export); deferred.

Stage 2: NORMALIZE — common record shape

Every record across all adapters lands in one shape:

{
    "id": str,            # source-native ID (stable for re-runs)
    "title": str,         # source-native title (may need re-derivation)
    "body": str,          # canonical markdown body
    "created": date,      # earliest known date (filename / file mtime / Evernote / content)
    "updated": date,      # latest known date
    "tags": list[str],    # source-native tags
    "source_url": str | None,  # original URL if web-clipped
    "attachments": list[dict], # {filename, mime, content_bytes_or_uri}
    "raw_meta": dict,     # everything else from the source, preserved for traceability
}

This is the contract — every downstream stage works against this shape only. Adapters are the only thing that knows source-native quirks.

Stage 3: AUDIT — per-record proposals + flags

For each normalized record, compute proposed transformations WITHOUT applying them:

Title cleanup: strip CleanShot-style autonames, expand contractions, sentence-case, drop bracketed metadata. Output: proposed_title.
Slug: lowercase-hyphenated from proposed_title. Output: proposed_slug. Collision-check against other records in this batch.
Date inference: Look at created/updated/filename/body for the canonical date. If multiple sources disagree, flag for review. Output: proposed_date + date_confidence (high/medium/low).
Tag inference: Combine source tags + content-derived tags (topic classification via Claude API on snippet, type classification — note / clipping / receipt / reference / todo / ephemera). Output: proposed_tags + tag_source (manual/inferred).
Body cleanup: strip Evernote/WordPress chrome, normalize whitespace, fix smart-quotes, anglicize spellings if dare-bound. Output: proposed_body.
Classification: evergreen / time-bound / stale / duplicate-suspect / empty-fragment. Output: classification.
Pruning candidate flag: True if any of (no body beyond URL + no tags + no inbound references + last-touched > 5 years).
Duplicate cluster: content-hash + near-duplicate detection (Levenshtein on titles, simhash on bodies). Output: duplicate_cluster_id if grouped.

Output of audit stage: - ~/Downloads/dare_content_cleanup_audit_<DATE>.md — human-readable summary - ~/Downloads/dare_content_cleanup_proposals_<DATE>.json — machine-readable per-record proposals

The audit ships first; apply waits for human review per the dare_lost_image_audit pattern.

Stage 4: APPLY — `--apply` rewrites

Reads the proposals JSON, applies transformations, writes the output corpus. Idempotent — re-running with the same input doesn’t change anything if proposals haven’t shifted.

Renaming happens here (filename = {proposed_date}-{proposed_slug}.md).
Body rewrites happen here (the proposed_body replaces the raw_body).
Tag updates happen here.
Pruning candidates are NOT auto-deleted — they get moved to a _pruning/ subdir for human review-before-delete (the lost-cats-stray-cats pattern: never lose anything irreversibly).
Duplicate clusters get a <cluster_id>/ subdir containing all variants + a MERGE_HERE.md placeholder for human review.

Stage 5: EXPORT — output corpus shape

Final shape (configurable per use case):

output/
├── 2026-05-18-keira-knightley-portrait-recovery.md
├── 2026-05-17-edge-health-toggle-shipped.md
├── ...
├── _pruning/
│   ├── 2018-03-04-empty-clip-from-medium.md  (3 lines, no body)
│   └── 2019-11-22-amazon-receipt-12345.md
├── _duplicates/
│   ├── cluster_001/
│   │   ├── 2020-06-12-keira-clip-v1.md
│   │   ├── 2020-06-13-keira-clip-v2.md
│   │   └── MERGE_HERE.md
├── _index.md                  # tag index + classification breakdown
└── _audit-trail.md            # every transformation log line

Configurable: per-tag subdirectories, hierarchical by year/month, hybrid.

Cross-portfolio applicability

The pipeline is the same engine for several recurring needs:

Use case	Source	Output shape
Dan’s Evernote 24k cleanup	Evernote MCP	Personal knowledge corpus → flat md folder, tags index
Client legacy-content engagement	client-supplied directory	Client deliverable — restructured corpus + audit
dare archive maintenance	WordPress export sub-folder	Same as today’s dare_migrate_articles, refactored as pipeline
dogwood photo + caption archive	Dropbox folder	Dog journal entries, tagged + dated
audrey product-description rewrite	Shopify CSV export	Listings → clean markdown, tagged by collection

The adapter abstraction is what makes it portable. Each new use case = one new adapter + maybe one classification-rule tweak.

Toolkit naming + conventions

Script name: dare_content_cleanup.py (dare_ prefix because canonical home is dare’s toolkit; the engine is brand-agnostic but the host is dare’s)
Per-source CLI: --source evernote|directory|notion|dropbox
Per-target CLI: --out <path> (defaults ~/Downloads/dare_content_cleanup_output_<DATE>/)
Audit-first: --dry-run default (per feedback_audit_first_then_batch.md)
Apply: explicit --apply
Sub-modes: --audit-only, --apply-only (read existing proposals JSON)

Open design questions (decide when unparking)

Classification quality. Topic/type inference via Claude API will be ~80% right. Worth pre-flighting on a 50-record sample before committing to a model + prompt.
Dedup threshold. Simhash + Levenshtein cutoffs need calibration. Too aggressive = lose distinct records; too loose = miss real dupes.
Pruning conservatism. A “delete me” recommendation should require multiple signals (no body + no tags + > 5 years + no inbound). Single signal = too aggressive.
Body classification of attachments. Receipts, scans, PDFs — text extract or store as-is? Probably text-extract for search, store-as-is for fidelity.
Reversibility. Output corpus + audit-trail.md should be enough to reconstruct the original state. Test on a 100-record sample before running on 24k.

Why this is high-value for client work

Every legacy-site / legacy-corpus engagement starts with the same friction: “here’s the messy reality, what do we do with it?” The pipeline turns that into a structured deliverable: - Week 1 of an engagement → audit. Stakeholders see proposed structure before any change. Builds trust. - Week 2 → apply on a sample (say, 10% of corpus). Stakeholders confirm direction. - Week 3+ → apply on full corpus + iterate on classification rules per stakeholder feedback.

The audit trail IS the engagement’s deliverable. Self-documenting work.

Sibling memories

feedback_internal_seo.md — the naming-discipline foundation this pipeline implements at bulk scale.
feedback_audit_first_then_batch.md — the audit-then-apply pattern this pipeline structurally embeds.
feedback_audit_js_dom_coupling_before_canonical_patches.md — see-once-audit, see-twice-build. Dan’s 24k is the first concrete user; the script earns its build cost on first real use.
project_dare_messaging_service_v1_built.md — same shape as notify-portfolio: thin shim over substrate work, portable per-source.
user_lost_cats_stray_cats_archival_recovery.md — the never-lose-anything-irreversibly stance; _pruning/ subdir + human review is its application here.
feedback_evaluate_managed_services_before_build.md — before building, check whether off-the-shelf tools (Logseq import, Obsidian importer, NotePlan converter) handle the use case. Last time I checked, none handle the AUDIT-first step or the cross-source adapter layer; the build earns its keep there.

Resume conditions

✅ Evernote MCP released (Dan’s primary trigger).
✅ First client engagement that includes “normalize this content directory” in scope.
✅ Portfolio brand archive needs a structured cleanup pass (e.g. dogwood’s existing photo + caption directory).
Earliest qualifying trigger gets the V1 build; subsequent triggers exercise the adapter-portability claim.

Source: parked_sketch_bulk_content_cleanup_pipeline_2026-05-18.md · Rendered 2026-05-18 12:53