Audrey’s 4TB photo library — migrate from Google/iCloud-dupe-mess to R2 (parked 2026-05-24)
DARE.CO.UK · PARKED SKETCH · 2026-05-31
Mirrored from ~/.claude/.../memory/parked_sketch_audrey_4tb_photo_library_to_r2_2026-05-24.md. This is a design sketch parked for future build — read for context, not as a current deliverable.
Audrey’s photo library is a duplicates-of-duplicates mess scattered across Google Photos + iCloud + multiple Drive locations. Three architectural options sketched (R2-native + Worker · Docker Immich + cloudflared · flat files + rclone sync + read-only viewer). Recommendation: Option C (flat files + rclone) — matches the portfolio’s “flat files in git, hosting swappable” promise, graduates to B when face-search / on-this-day become missed features. First reversible move: create R2 bucket + one rclone sync. ~$5-10/mo for moderate library, no hardware required for C.
Dan 2026-05-24: “Audrey has duplicate problem, copies-of-copies-of-copies in multiple google locations, plus icloud, it’s a troublesome mess. so I’m keen to go into sketch-mode.”
The problem
| Where photos live today | Status |
|---|---|
| Google Drive (multiple folders) | Duplicates-of-duplicates, no canonical version |
| Google Photos | Auto-uploaded copies overlapping with Drive |
| iCloud | Yet another copy plane |
| Local Mac | Where actual edits + imports happen |
The duplicate sprawl is the real pain — same photo lives in 3-5 places, no single source of truth, no clear “what’s been backed up” status, no programmatic dedupe path.
Three architectural options (from Claude-on-desktop transcript)
Option A — R2-native, no always-on hardware
- R2 bucket holds originals
- Tiny Cloudflare Worker provides API (upload, list, signed-URL fetch, thumbnail variants via Cloudflare Images)
- Pages app at
photos.gf.cxis the UI - CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access policy on hostname keyed to Dan’s email; Worker validates JWT
- Bucket stays private; Worker is only reader
Trade-offs: - ✓ Zero hardware, zero always-on - ✓ ~$0.015/GB/month, zero egress · 500GB ≈ $7.50/mo - ✗ No Immich-like app (no face recognition, no on-this-day, no search-by-content) - ✗ Fine as working archive, weak as daily-driver Google Photos replacement
Option B — Docker box at home running Immich, behind cloudflared
- Mac mini / Pi 5 / corner of dev box
- Immich (dominant self-hosted Google Photos clone in 2026): face detection, iOS/Android apps with auto-upload, smart search
- Docker compose, very actively developed
- cloudflared exposes
immich.gf.cx; CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access in front - Immich machine binds to localhost only — only path in is outbound tunnel
Trade-offs: - ✓ Full Google-Photos-equivalent UX - ✓ Mobile auto-upload, face search, smart albums - ✗ Machine has to be on (electricity + maintenance) - ✗ Immich keeps its own database mapping files → metadata; on-disk layout isn’t quite “just files in dated folders” (exportable, not transparent)
Option C — Flat files locally, rclone to R2, tiny read-only Pages viewer ⭐ RECOMMENDED
- Photos live in
~/Photos/2026/05/...on Mac — just HEIC/JPG in dated folders rclone sync ~/Photos r2:photos-gf-cxruns hourly via launchd or Hazel rule- Pages site at
photos.gf.cxreads R2 (via signed-URL Worker), behind CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access, browse-anywhere - Claude Code does filesystem operations (
mv,cp,find) directly because files are right there
Trade-offs: - ✓ Matches the portfolio’s “flat files in git, hosting swappable” promise (architectural consistency) - ✓ Mac is write surface (already is — that’s where import + edit happen) - ✓ R2 is the durable copy - ✓ Viewer is near-static read of what’s in R2 - ✓ Vendor-independent (rclone speaks B2, S3, Azure, local disk) - ✓ Claude Code is best at filesystem ops - ✓ ~$5-10/mo for moderate library, no hardware - ✗ No face search, no on-this-day (graduate to B if those become missed features) - ✗ Read-only from mobile (graduate to A when upload-from-mobile matters)
Recommendation — REVISED 2026-05-24 (after second pass on classification)
Go with B (Immich) directly. The previous “start with C, graduate later” recommendation is superseded.
The flip is driven by the classification problem that wasn’t fully accounted for in pass 1:
- Audrey’s 4TB library mixes client work, personal life, baby photos (10,000+ for first two years)
- Without face clustering + semantic search, the library is “vendor-portable” but unfindable
- The directory tree is movable; the index that makes it usable is not
- Once index volume crosses ~tens of thousands of photos, the flat-file promise quietly breaks
The compromise that preserves the portfolio’s “files-not-platform” promise:
| Layer | What |
|---|---|
| Working | Immich on a Mac mini at home (Docker) — face clusters, CLIP search, mobile auto-upload, dated/EXIF-correct flat files underneath |
| Durable mirror | rclone sync of Immich’s storage folder → R2 (tier 3) — same files Immich operates on, mirrored for vendor independence |
| Immutable origin | Takeout dump kept frozen in R2 Archive tier — the T=0 ground truth, never deleted |
If Immich disappeared tomorrow: you lose face-cluster + CLIP index, but the underlying organised flat files survive. Same trade as everything else in the portfolio — underlying record durable, index on top replaceable.
The Mac mini becomes the photographic equivalent of the dare.co.uk Worker — small, single-purpose, behind a CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access tunnel, nothing public, files on disk underneath. Same architectural shape as the rest of the gf.cx stack.
Why C alone falls short
| Workflow | C (flat files) | B (Immich) |
|---|---|---|
| “All photos of Audrey + baby with grandma” | ✗ manual scroll | ✓ face cluster intersection |
| “Whiteboards from client meetings” | ✗ rgrep filenames + hope | ✓ CLIP semantic search |
| “Receipts from work travel 2024” | ✗ manual triage | ✓ CLIP “receipts” + date filter |
| “Photos of the broken sprayer for insurance” | ✗ remember when it broke | ✓ CLIP “sprayer” or face/place |
| Mobile auto-upload | ✗ doesn’t exist | ✓ standard Immich feature |
| Vendor independence | ✓ R2-mirrored flat files | ✓ same — rclone mirror of Immich storage |
C is still a fine FIRST step (the immutable origin lands in R2 regardless), but the working layer should be Immich from day one.
Maps directly onto the gf.cx tier diagram
The photo work is tier 3 + tier 4 territory in the payload.gf.cx framing:
| Tier | Role in photos work |
|---|---|
| Tier 1 (git) | Not used — too large for git |
| Tier 2 (payload.gf.cx public R2) | Not used — photos are private |
| Tier 3 (CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF-Access-gated R2) | The destination — photos.gf.cx bucket, Worker-fronted, Access policy keyed to Dan’s email. Audrey + Dan have a browse-anywhere copy. No one else does. |
| Tier 4 (off-cloud) | The source of truth — Mac local ~/Photos/... + Time Machine + encrypted external. Stays authoritative even if R2 disappears. |
All 3 options (A / B / C) keep tier 4 intact. They differ in WHAT lives in tier 3: - A — R2 bucket + minimal Worker API (no app, just access) - B — Immich app running on Docker box, R2 as Immich’s backing store - C — flat-file R2 mirror via rclone + read-only viewer
The “first reversible move” is purely a tier-3 setup — create the bucket, sync once, you have tier-3 today.
The common architecture across all three
“the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access tunnel pattern is doing the same trick in all three options — it lets the actual service run with zero internal auth, because the network path itself is the auth”
Cloudflare Access on the hostname = the auth. The photos app, the worker, the Immich instance — none of them know who Dan is. Access knows, that’s enough. This is the de-facto homelab shape in 2026.
The “first reversible move” (no-commitment first step)
- Create R2 bucket
photos-gf-cx - Generate R2 API token
rclone sync ~/Photos r2:photos-gf-cxof whatever’s currently in the library- Now there’s a vendor-independent backup TODAY, while you decide whether C is enough or you want Immich on top
Cost: essentially $0 for the first sync (R2 PUTs free under 1M/month). Storage cost ramps with size: - 100 GB ≈ $1.50/mo - 500 GB ≈ $7.50/mo - 1 TB ≈ $15/mo - 4 TB (Audrey’s full library scope) ≈ $60/mo
Dedupe — the actual pain point (3-pass strategy)
| Pass | What | Auto-resolve? |
|---|---|---|
| 1. Exact byte-hash (SHA256) | Catches album-duplication + trivial copies. Run at ingest. | ✓ safe to auto-dedupe |
| 2. Perceptual hash (pHash / dHash) | Catches edited-vs-original, crops, same shot at different resolutions | ✗ surface for review — never auto-delete |
| 3. EXIF heuristics | Same camera + timestamp ± 2s + same dimensions = almost certainly same shot in different formats | ✗ surface for review |
Tools:
- Immich does pass 1 + 2 natively at ingest + ongoing maintenance task
- czkawka (free, Rust, GUI + CLI) — standalone perceptual dedup; best-in-class image dedup including near-dupes
- rclone dedupe — built-in for the rclone-sync path; file-level hash only
- fdupes / jdupes — classic CLI hash dedup
- exiftool — manual surgery when automated tools don’t quite get it right
Pattern: hash-dedupe AT INGEST so we don’t import duplicates in the first place; perceptual-dedupe as a maintenance task once library is loaded; manual review of perceptual matches when in the mood.
Google Takeout — the five known sharp edges
Critical to know BEFORE running migration:
-
Metadata isn’t in the photos. Each photo exports with a
*.jsonsidecar carrying real date, GPS, description, album memberships, favorites flag. The EXIF inside the image is often stripped or wrong because Google modified it server-side. Tools must reconcile sidecar → image and write metadata back. Naive ingestion loses dates entirely. -
Filenames get truncated. Google chops filenames around 46 chars, and the JSON sidecar may use a different truncation. Pairing
long_filename_2023.jpgtolong_filename_20.jsonis non-trivial. Tools handle this with varying quality. -
Album duplication. A photo in 3 albums → exported 3 times in 3 folders. Without smart ingestion you triple your library. You usually want to preserve album structure but as tags, not duplicates.
-
Live Photos split into .HEIC + .MOV (sometimes recombined, sometimes not). Edited versions exported alongside originals. Both inconsistent.
-
Downloads are chunked. A 200GB library is ~40 × 50GB zips. Scripting the assembly is required. Takeout itself can take days to generate for a large account (Audrey’s 4TB → several days minimum).
Migration tools — two clean paths
| Tool | When to use |
|---|---|
immich-go |
Written by the Immich team for Takeout. Sidecar reconciliation + album mapping + hash dedupe + edited-version pairing. Cleanest path if Immich is the destination. |
google-photos-takeout-helper (TheLastGimbus / GitHub) |
Flat-file path — reconciles JSON back into EXIF, organises by date, dedupes within albums, outputs “real” files. Use if going flat-file route OR as a pre-step before Immich import. |
Both are well-maintained, both have edge-case bugs you’ll discover if the library has anything weird in it.
The 90-day guardrail
Don’t delete from Google for 90 days minimum after migration.
- Once a week, spot-check date ranges (oldest month, newest, a known-volume holiday) and verify counts
- If something’s missing, pull from Google before source goes away
- Google won’t delete on you, just keep charging — the value of being able to query the old source during verification is enormous
- People have stories about discovering missing batches three weeks in
Detecting what’s missing
| Check | How |
|---|---|
| Total counts | Google Storage panel = total media count. Takeout file count should match minus JSON sidecars. Discrepancy = export failure |
| Sample by date | Pick months from 2018 / 2022 / last year, count + compare to Google’s date browser |
| Sample by album | “Italy 2019” had 312 photos in Google? Count yours |
| Exhaustive verification | NOT easily possible — Google’s API isn’t designed for it. 90-day rule is the practical mitigation |
Source of truth — 3 states, not 1
The architectural clarification that matters most:
| State | What | Where |
|---|---|---|
| T=0 Takeout dump (frozen) | Ground truth on the day of export. Never touched. The “if I screwed up the import, I can re-derive everything” backstop. Keep for at least a year, ideally forever. | R2 Archive tier (~$0.0036/GB/mo) + external SSD in a drawer (belt-and-braces) |
| T=0+1 working library (going forward) | Source of truth from here on. Where new photos land, where edits + tags accrue. | Immich instance (in option B); the Mac filesystem (in option C) |
| R2 durable mirror | NOT the truth itself — a copy of working state. Continuous sync. | R2 standard storage; rclone or Immich → S3 backend |
Don’t conflate “I have a copy on R2” with “I have a working archive.” R2 backs up state; the working library is state.
For 4TB Audrey-scale: - T=0 frozen in R2 Archive: ~$14/mo forever - Working library + R2 mirror at standard: ~$60/mo
Classification — why Immich wins (the killer features)
| Feature | Why it earns its keep |
|---|---|
| Face clustering | Tag a cluster ONCE (“Wife” / “Client X” / “Baby”) → 8,000 photos instantly searchable. For baby photos this is enormous — first two years = 10,000+ photos, manual tagging impossible |
| CLIP semantic search | “Whiteboards” finds every client meeting whiteboard → instant client-work segregation. “Beach” finds vacations. “Receipts” finds receipts. Free text query, no manual tagging required |
| Folder-based at ingest | Google albums “Client — Acme” / “Personal — 2024” become Immich albums automatically via immich-go |
| Geographic | Photos at client addresses during business hours = client. Photos at home = personal. Crude but useful for first-pass triage |
Realistic workflow: face-cluster first (one evening → ~70% of human-photo classification). Album-preserve at ingest (another ~15%). Manual tagging for the rest, ambient as you encounter photos during normal browsing. Don’t try to do it all at once.
Can it run on RAID inside Docker? (Dan’s question)
Yes — standard pattern. Immich + Docker Compose + RAID is the dominant Immich-at-home shape.
- Immich runs as a Docker Compose stack (web + ML + database + storage)
- The
UPLOAD_LOCATIONenv var points at any host path you mount in — a RAID array, NAS over NFS/SMB, or just plain disk - RAID gives you intra-tier-4 redundancy on the working set: if one drive in the mirror fails, no data loss + no downtime
- Time Machine on the Mac hosting Docker gives a separate snapshot layer
- R2 mirror gives off-site copy
Typical home setup:
- Mac mini (or Synology NAS, or Linux box) with 2× large drives in RAID 1 mirror, mounted as /srv/photos
- Docker Compose stack with Immich pointing UPLOAD_LOCATION=/srv/photos
- cloudflared tunneling immich.gf.cx → localhost:2283 (Immich’s port)
- CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access policy on the hostname
If the Mac mini host dies: the RAID drives can be physically moved to a replacement host, Docker stack comes back up reading from same paths, no data loss. If a single drive fails: hot-swap, RAID rebuilds.
Dan’s layer model (2026-05-24, mid-sketch refinement)
“Layer 0 could involve a lot of batching and building independent library that can live on photos.gf.cx — and we serve behind a CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access”
| Layer | What | Where | Auth |
|---|---|---|---|
| Layer 0 — infrastructure | RAID array + Docker host + cloudflared tunnel | Mac mini @ home | physical access |
| Layer 1 — application + library | Immich app + working library (the indexed substrate, where face clusters / CLIP / tags / albums live) | Container reaching the RAID + photos.gf.cx hostname served via tunnel | CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access keyed to Dan + Audrey |
| Layer 2 — durable mirror | rclone sync to R2 standard bucket | R2 region us-east | bucket private; Worker-fronted if browseable |
| Layer 3 — immutable origin | T=0 Takeout dump frozen | R2 Archive tier | private bucket, manual retrieval |
The CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access boundary at Layer 1 means photos.gf.cx is the only public hostname — the Docker host’s internal network, the RAID mounts, the Postgres metadata DB are all unreachable except via the tunnel. Same architectural promise as claim.gf.cx (signed-content / PII-bearing surfaces) and ask-opus.gf.cx.
It IS a massive undertaking — Dan’s framing is correct. The smallest first useful step that doesn’t commit to the whole stack:
- Today / this week — Trigger Google Takeout export (takes Google days to generate). No infrastructure decisions yet.
- Within the 90-day window — Set up Mac mini + RAID + Docker + Immich locally (one weekend). Ingest just one year of photos via
immich-goas a feel-test. - After feel-test passes — Bind
photos.gf.cxvia the same recipe we used foramazon-evidence.gf.cx+ CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access policy. - After photos.gf.cx is live + happy — Backfill remaining ~3.9 TB. Set up rclone → R2 sync (Layer 2).
- After Layer 2 is verified — Begin deletion from Google (one folder at a time, never the whole library at once).
Each step is reversible. Each step delivers a working capability. The full migration spans probably 2-3 months of evenings + weekends, not a single sprint.
Multi-axis addressability — sub-subdomains as smart views (Dan 2026-05-24)
“This is where audrey.photos.gf.cx — and dan.photos.gf.cx comes into play, as you can say, work.photos.gf.cx/2010, or school.photos.gf.cx/1995, which are tailored and smart”
The photo library isn’t a single surface — it’s a substrate with N addressable views, each pre-filtered to a persona, category, or era:
| URL shape | Filter applied at edge | Use |
|---|---|---|
photos.gf.cx/ |
unfiltered (default Immich view) | full browse |
audrey.photos.gf.cx/ |
face-cluster: Audrey | “everything of Audrey” |
dan.photos.gf.cx/ |
face-cluster: Dan | “everything of Dan” |
baby.photos.gf.cx/ |
face-cluster: baby | the 10K-photo cohort |
work.photos.gf.cx/2010 |
album-tag: work + year: 2010 | client-work photos from a specific year |
school.photos.gf.cx/1995 |
album-tag: school + year: 1995 | era-specific recall |
wedding.photos.gf.cx/ |
album: wedding | guest-share-friendly |
claim-evidence.photos.gf.cx/ |
tag: insurance-claim | tier-3-gated subset for claims work |
Each URL is a shareable, bookmarkable, tailored view of the same substrate. Audrey doesn’t navigate to “the photos site, click filters” — she goes to audrey.photos.gf.cx and is already there. Dan texts a tenant “check bathroom.photos.gf.cx” for a specific damage view.
This tips the ACM ($10/mo) decision
Earlier in the session (feedback_cf_pages_subdomain_setup_recipe.md) we noted: ACM becomes worth it when you have 8+ sub-subdomain candidates. The sketch above lists 8 immediately and implies many more (bathroom, kitchen, 2008, vacation, food, etc.). Photos library is the use case that justifies enabling ACM for the gf.cx zone.
Two implementation paths
Path A — one Immich, filter at the Worker edge (RECOMMENDED):
- One Immich instance running on the Mac mini + RAID + Docker
- All photos in one library with face clusters + albums + tags
- Each
<persona>.photos.gf.cxis a tiny Cloudflare Worker that hits Immich’s REST API with a pre-baked filter (face cluster ID / album ID / tag) - Worker URL pattern:
<persona>.photos.gf.cx/*→ Worker translates request → fetches from Immich behind cloudflared tunnel → returns pre-filtered HTML/JSON - Cheap to spin up new views: deploy another Worker route in ~5 min
- Cross-references work naturally: a photo of Audrey + Dan together shows on both
audrey.ANDdan.subdomains
Path B — multiple Immich instances:
- One Immich per persona/category
- Heavier (multiple Postgres instances, multiple ML model copies, multiple Docker stacks)
- Watertight isolation if persona-level privacy ever mattered
- Probably overkill for a married couple sharing a library; needed only if guest-share without leaking adjacent content becomes important
Recommendation: Path A. Same architectural promise as the existing tier-3 surfaces — one substrate, many views, Workers filter at the edge. Single source of truth, many smart entry points.
Public OR private per hostname — same architecture, different policy
Dan 2026-05-24: “I can managed these sets, at it will always load super-fast, edge-cache, protect behind CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF access, or publically available for some images.”
The architecture supports both CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF-Access-gated (private) AND fully-public hostnames, identically, with one config flag:
| Hostname | Filter | Access policy | Use |
|---|---|---|---|
audrey.photos.gf.cx |
face: Audrey | CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access (Dan + Audrey emails) | private daily view |
dan.photos.gf.cx |
face: Dan | CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access (Dan + Audrey emails) | private daily view |
baby.photos.gf.cx |
face: baby | CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access (Dan + Audrey + immediate family) | semi-private |
wedding.photos.gf.cx |
album: wedding | CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access (guest list email allowlist) | broader semi-private |
claim-evidence.photos.gf.cx |
tag: insurance | CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access (Dan + temp adjuster grant) | tightly-scoped private |
portfolio.photos.gf.cx |
tag: portfolio-public | NONE — fully public | audreyinc photography portfolio |
press.photos.gf.cx |
album: press-kit | NONE — fully public | brand press kit for Audrey’s work |
landscape-art.photos.gf.cx |
tag: showcase-landscape | NONE — fully public | exhibition showcase |
Adding a public-facing photo surface is the same 5-minute deploy as adding a private one — just skip the CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access policy add. The Worker reads the same R2 mirror; the filter logic is the same shape.
This means the architecture supports BOTH halves of a creative person’s photo life: - Private: family, baby, household inventory, claim evidence - Public: portfolio work, press, brand assets, exhibition showcases
…from ONE library, ONE source of truth, ONE substrate. The public/private cut is a deploy-time decision per hostname, not a storage-time decision per file.
CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access policies can be per-subdomain (private side detail)
The ACM cert covers the whole second-level zone, but each sub-subdomain still gets its own CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access policy:
- audrey.photos.gf.cx — gates to Audrey’s email + Dan’s
- dan.photos.gf.cx — gates to Dan’s email + Audrey’s
- baby.photos.gf.cx — gates to Dan + Audrey + immediate-family allowlist
- wedding.photos.gf.cx — gates to a broader guest list (or maybe a service token + share-link)
- claim-evidence.photos.gf.cx — gates to Dan + the insurance adjuster’s email (temp)
The CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access boundary becomes the share-control plane, sub-subdomain by sub-subdomain.
Adds to the layer model
Updated Layer 1 row in the table above:
| Layer | What |
|---|---|
| Layer 1 — application + library | Immich (single instance) on Mac mini + RAID + Docker; many small Worker filters per sub-subdomain at *.photos.gf.cx (Path A); CDN, security layer, and DNS provider sitting in front of dare.co.uk.">CF Access policy per hostname; ACM enabled for the gf.cx zone to cover the sub-subdomain SSL |