Bulk content cleanup pipeline — sketch parked for Evernote MCP + client engagements

DARE.CO.UK · PARKED SKETCH · 2026-05-18

Mirrored from ~/.claude/.../memory/project_bulk_content_cleanup_pipeline_parked.md. This is a design sketch parked for future build — read for context, not as a current deliverable.

2026-05-18 sketch. Multi-stage pipeline for ingesting messy content corpora (Evernote 24k notes, client directories of HTML/PDF/Word/scans), normalizing into a common shape, applying renaming + SEO-friendly slugification + tag inference + dedup + pruning candidates, and exporting a clean output corpus. Personal trigger: Dan’s 24k stale Evernote records when MCP releases. Real value: identical engine handles “client gives you a folder of legacy content” — the recurring shape of fix-up engagements.


Trigger for unparking: first qualifying input materialises — - Evernote MCP server is publicly released AND Dan wants to pull a 1% test slice (~240 of 24k records), OR - A client engagement lands with a “here’s the directory of legacy content, normalize it” ask, OR - A portfolio brand’s own archive needs the same treatment (dare’s WordPress export already passed through this kind of process; dogwood/audrey will).

Goal: turn N records of messy provenance into N’ records (N’ ≤ N after dedup + pruning) with: - Descriptive, slugified filenames (per feedback_internal_seo.md) - Normalized metadata (created/updated dates, tags, source URLs, attachments) - Inferred classification (type, topic, freshness, action-required) - Audit trail showing every transformation + uncertain decisions human-reviewable - Pruning candidate list (obvious stale / empty / duplicate) for review-before-delete

Architecture — 5-stage pipeline

┌────────┐   ┌───────────┐   ┌───────┐   ┌─────────┐   ┌────────┐
│ INGEST │ → │ NORMALIZE │ → │ AUDIT │ → │  APPLY  │ → │ EXPORT │
└────────┘   └───────────┘   └───────┘   └─────────┘   └────────┘
   ↓             ↓               ↓            ↓             ↓
 source-       common         per-record   transformed    output
 specific      record         proposal     records +      corpus
 adapter       shape          + flags      audit trail    (md folder
                                                          + index)

Stage 1: INGEST — source-specific adapters

Plugin layer per source-of-content. Each adapter exposes iter_records() → yields raw records in source-native shape.

FIRST MOVE WHEN UNPARKING (Dan, 2026-05-18 evening): before writing ANY adapter, do a GitHub sweep for existing open-source work:

The build-vs-borrow calculus (per feedback_evaluate_managed_services_before_build.md): if a maintained OSS project handles the ingest reliably, our adapter shim is ~20 LOC wrapping their CLI output into our normalized record shape. If we’d be the maintainer of source-specific parsing logic, we own a debt — every Evernote format change is our problem. Reuse the format-handling layer; OWN the audit + apply + classification logic, which is where the value lives anyway.

Initial adapters: - Evernote MCP — when released, expose notes via MCP server. Adapter likely wraps an existing evernote2md / enex-dump style tool’s output, then re-shapes into our normalized record. - Generic directory — walks a folder of .md / .html / .pdf / .docx / .txt, extracts text via Pandoc / pdfplumber / docx parsers (all existing OSS — same reuse stance). - Notion export — likely-future; Notion’s export is structured enough for direct parsing; check for notion-to-md style tools. - Dropbox folder — same shape as generic directory; the Dropbox MCP (when stable) replaces filesystem walk with API. - iCloud Notes — harder (no good export); deferred.

Stage 2: NORMALIZE — common record shape

Every record across all adapters lands in one shape:

{
    "id": str,            # source-native ID (stable for re-runs)
    "title": str,         # source-native title (may need re-derivation)
    "body": str,          # canonical markdown body
    "created": date,      # earliest known date (filename / file mtime / Evernote / content)
    "updated": date,      # latest known date
    "tags": list[str],    # source-native tags
    "source_url": str | None,  # original URL if web-clipped
    "attachments": list[dict], # {filename, mime, content_bytes_or_uri}
    "raw_meta": dict,     # everything else from the source, preserved for traceability
}

This is the contract — every downstream stage works against this shape only. Adapters are the only thing that knows source-native quirks.

Stage 3: AUDIT — per-record proposals + flags

For each normalized record, compute proposed transformations WITHOUT applying them:

Output of audit stage: - ~/Downloads/dare_content_cleanup_audit_<DATE>.md — human-readable summary - ~/Downloads/dare_content_cleanup_proposals_<DATE>.json — machine-readable per-record proposals

The audit ships first; apply waits for human review per the dare_lost_image_audit pattern.

Stage 4: APPLY — --apply rewrites

Reads the proposals JSON, applies transformations, writes the output corpus. Idempotent — re-running with the same input doesn’t change anything if proposals haven’t shifted.

Stage 5: EXPORT — output corpus shape

Final shape (configurable per use case):

output/
├── 2026-05-18-keira-knightley-portrait-recovery.md
├── 2026-05-17-edge-health-toggle-shipped.md
├── ...
├── _pruning/
│   ├── 2018-03-04-empty-clip-from-medium.md  (3 lines, no body)
│   └── 2019-11-22-amazon-receipt-12345.md
├── _duplicates/
│   ├── cluster_001/
│   │   ├── 2020-06-12-keira-clip-v1.md
│   │   ├── 2020-06-13-keira-clip-v2.md
│   │   └── MERGE_HERE.md
├── _index.md                  # tag index + classification breakdown
└── _audit-trail.md            # every transformation log line

Configurable: per-tag subdirectories, hierarchical by year/month, hybrid.

Cross-portfolio applicability

The pipeline is the same engine for several recurring needs:

Use case Source Output shape
Dan’s Evernote 24k cleanup Evernote MCP Personal knowledge corpus → flat md folder, tags index
Client legacy-content engagement client-supplied directory Client deliverable — restructured corpus + audit
dare archive maintenance WordPress export sub-folder Same as today’s dare_migrate_articles, refactored as pipeline
dogwood photo + caption archive Dropbox folder Dog journal entries, tagged + dated
audrey product-description rewrite Shopify CSV export Listings → clean markdown, tagged by collection

The adapter abstraction is what makes it portable. Each new use case = one new adapter + maybe one classification-rule tweak.

Toolkit naming + conventions

Open design questions (decide when unparking)

  1. Classification quality. Topic/type inference via Claude API will be ~80% right. Worth pre-flighting on a 50-record sample before committing to a model + prompt.
  2. Dedup threshold. Simhash + Levenshtein cutoffs need calibration. Too aggressive = lose distinct records; too loose = miss real dupes.
  3. Pruning conservatism. A “delete me” recommendation should require multiple signals (no body + no tags + > 5 years + no inbound). Single signal = too aggressive.
  4. Body classification of attachments. Receipts, scans, PDFs — text extract or store as-is? Probably text-extract for search, store-as-is for fidelity.
  5. Reversibility. Output corpus + audit-trail.md should be enough to reconstruct the original state. Test on a 100-record sample before running on 24k.

Why this is high-value for client work

Every legacy-site / legacy-corpus engagement starts with the same friction: “here’s the messy reality, what do we do with it?” The pipeline turns that into a structured deliverable: - Week 1 of an engagement → audit. Stakeholders see proposed structure before any change. Builds trust. - Week 2 → apply on a sample (say, 10% of corpus). Stakeholders confirm direction. - Week 3+ → apply on full corpus + iterate on classification rules per stakeholder feedback.

The audit trail IS the engagement’s deliverable. Self-documenting work.

Sibling memories

Resume conditions

Source: parked_sketch_bulk_content_cleanup_pipeline_2026-05-18.md · Rendered 2026-05-18 12:53