Project write-up · 2026

ShelfSync: collapsing a balkanized book wishlist into one source of truth

I had books scattered across Goodreads, Bookshop, and nowhere on ThriftBooks. I wanted one canonical list — my Goodreads to-read shelf — and everything else to follow it automatically. This is how I built the mirror that does it, and how I taught it to stop emailing me nothing.

Python 3.12 Browser automation Web scraping Entity resolution Docker + systemd Idempotent sync
~2.3k
lines of Python
3
runtime deps
28
tests
2
sinks
275
works cached
33
titles mirrored

29 commits over seven days. Numbers from the live deployment.

01The problem: three wishlists, no truth

Like a lot of readers, I'd let my "want to read" list sprawl. Some titles lived on Goodreads, where I actually track reading. A different, overlapping set sat on a Bookshop.org wishlist, because that's where I buy new from independent stores. And ThriftBooks — where I buy used — had nothing at all, even though it's often the cheapest way to get a book I want.

The result was the worst kind of state-management problem: three partial copies of the same intent, none authoritative, all drifting. I'd add a book in one place and forget it everywhere else. I'd buy something and leave its ghost on two other lists. There was no single answer to the question "what do I want to read?"

I wanted to levelize the landscape: pick one origin of truth — my Goodreads to-read shelf — and have the others mirror it automatically. Add a book on Goodreads, and it appears on Bookshop and ThriftBooks. Finish or abandon one, and it quietly disappears from both. I never touch the downstream lists by hand again.

The deceptively hard part Mirroring isn't copying. The same book is a different object on every site — different IDs, different ISBNs (hardcover vs. paperback vs. ebook), sometimes a different title string. To know that "the book on Goodreads" and "the listing on ThriftBooks" are the same work, you need a shared notion of identity that none of these services agree on.

02The solution: a one-way mirror with a shared identity layer

ShelfSync is a small Python service that runs once a day per destination. Each run reads the canonical shelf, reads the destination list, resolves every book on both sides to a shared identity, diffs the two sets, and then makes the destination match the source — adding what's missing, removing what's orphaned. Then it reports what it did.

The design rests on three deliberate choices:

RSS resolve / cache read · write notify Goodreads to-read shelf Open Library work-key resolver + disk cache ShelfSync core resolve → diff → reconcile matcher · caps · guards Ledgers added · removed · not-found Bookshop sink search · add · remove ThriftBooks sink search · add · remove Discord change-only reports bookshop.org (Cloudflare) thriftbooks.com (login-gated)
The whole system on one page. The core never knows which site it's talking to — that's the sink's job — and never trusts an ISBN to decide whether two books are the same.

03A day in the life of a sync run

Every morning, two systemd timers fire twenty minutes apart — one per destination, staggered so two headed-Chromium runs don't fight over the machine. A single run looks like this:

  1. 07:30 LOCAL Timer fires A systemd timer triggers the sync for one sink. Persistent=true means a missed run (laptop asleep, VPS rebooted) catches up on next boot instead of silently skipping a day.
  2. 🐳 +0s Container starts A Docker container boots under a virtual X display. The browser loads a persistent profile — keeping the hard-won Cloudflare clearance and login session warm between days, so most runs sail straight through.
  3. 📥 read Read both shelves Goodreads comes from its public RSS feed (no API, no credentials). The destination wishlist is scraped from the live site. Two lists of raw, messy book records.
  4. 🧬 resolve Resolve to identity Every book on both sides is resolved to an Open Library work key — by ISBN if we have one, else by a title+author lookup. Results are cached to disk, so the 275-work catalog mostly answers from memory.
  5. ↔︎ diff Diff the sets Compare work-key sets: what's on Goodreads but missing downstream (to add), what's downstream but no longer on Goodreads (to remove), and what's already in sync.
  6. add Add phase For each missing book, search the destination site, match the right listing (the genuinely hard step — see §6), and add up to a per-run cap. Everything added is written to a ledger.
  7. remove Remove phase Drop orphaned entries, but only after two guards veto the known false positives. Removes run under a smaller cap because deletion is the destructive direction.
  8. 🔔 notify Decide whether to speak If the run changed something, hit an error, or found a newly-unavailable book, it posts a Discord report. If nothing happened, it writes the report to disk and stays silent. (This is §7 — the part this whole post builds toward.)
  9. 🔁 07:50 Do it again for the next sink Twenty minutes later, the second timer runs the identical pipeline for the other destination, with its own isolated state volume.

04Why ISBNs can't be the key

The naïve mirror matches books by ISBN. It breaks immediately. The hardcover I shelved on Goodreads, the paperback Bookshop stocks, and the used copy on ThriftBooks are three editions with three different ISBNs of one work. Match on ISBN and the same book looks like three different books — so the mirror endlessly "adds" a copy that's already there in a different binding, and "removes" the one it just added.

Open Library models this correctly: editions roll up to a work, identified by a key like /works/OL12345W. ShelfSync resolves every book to its work key and does all matching there. ISBNs are demoted to mere hints for finding the work; once found, they're irrelevant to whether two records are "the same book."

This one decision quietly fixes a whole class of bugs — and creates a subtler one. Open Library's resolver isn't perfectly stable: the same title can occasionally resolve to a slightly different work key over time ("work-key drift"). A book the tool added yesterday can look both missing and orphaned today. The fix for that is the ledgers (§8) — the system's memory of what it has already done.

05Reading sites that don't want to be read

Neither destination offers a usable API for this, so the sinks drive a real browser. Each site fights back differently, and each needed a different answer:

Bookshop: behind Cloudflare

Bookshop sits behind Cloudflare's bot defenses, which detect and block ordinary automated browsers. The sink uses a hardened build of the automation framework that closes the specific fingerprinting leak Cloudflare keys on, paired with a persistent browser profile so the clearance cookie survives between runs. Most mornings the run never sees a challenge at all — it reuses yesterday's trust.

ThriftBooks: reading is open, writing is gated

ThriftBooks lets you browse and search freely, but acting on a wishlist requires an authenticated session. The sink keeps that session alive in its own persistent profile and reuses it day to day, isolated from the Bookshop profile so the two destinations never share state or step on each other.

A note on responsibility This is a personal-scale tool: one user, a couple of dozen books, two requests a day, well within what a human browsing the same sites would do. It keeps secrets out of the container, runs read-only by default, and exists to automate my own account — not to scrape anyone at scale.

06The matcher: the part that earns its tests

Resolving Goodreads to a work key is one thing; finding the right listing for that work on a specific store is another. The store search for "Johnson" returns dozens of books by different Johnsons. Picking wrong is worse than picking nothing — it means buying the wrong book — so the matcher is deliberately conservative and is where most of the test suite lives.

It tokenizes title and author, anchors on the author's surname, and refuses a match on weak signals. Over development it grew a set of explicit guards against real failures I watched it make:

A concrete win One pass over the live ThriftBooks search recovered 5 of 8 books the matcher had been wrongly reporting as "no product found." They were stocked all along — the search-result matching was just too strict in some cases and too loose in others. Each fix landed with a regression test so it stays fixed.

07Writing carefully: dry-runs, caps, and ledgers

Adding to a wishlist is forgiving. Removing is not. So writes are wrapped in layers of caution:

The system's memory is three small JSON ledgers, each keyed by work key:

added.json
Every book the tool added (or found already there). Stops work-key drift from re-adding a different edition every run, and protects those books from being removed.
removed.json
Every entry the tool deleted, for traceability and undo.
notfound.json
Books that resolve but have no stocked edition — so a book that can't be bought is reported once, not every single day. (§8)

Removes get two extra hard guards on top of the ledger: a book whose ISBN the tool added itself is never removed (work-key drift makes it look orphaned), and a downstream entry whose title strongly overlaps a Goodreads book that failed to resolve is kept — because the orphan is an artifact of the resolver, not a real deletion. Both guards fire on the same handful of false positives every run, so the report collapses them to a single count line instead of relisting them.

08Teaching it to stop talking when there's nothing to say

Here's the arc that prompted writing this all down. A daily mirror that's working correctly does, most days, nothing — the lists are already in sync. But every morning it dutifully posted a report anyway, two messages per destination:

▸ BEFORE — four empty reports a day

📚
ShelfSyncAPP07:30
📚➕ Bookshop wishlist add
✅ Added — 0
(none)
📚
ShelfSyncAPP07:30
📚➖ Bookshop wishlist remove
🗑️ Removed — 0
(none)
… and the same two again at 07:50 for ThriftBooks. Every. Single. Day.

▸ AFTER — silence, unless there's news

— a synced morning: nothing posted. The full report is written to disk for the record, but Discord stays quiet. —
📚
ShelfSyncAPPon a day that mattered
📚➕ Bookshop wishlist add
✅ Added — 2
• Piranesi — Susanna Clarke
• The Master and Margarita — Bulgakov
🆕 New — no product found — 1
• A wanted book that isn't stocked yet

The fix has two parts. The first is a notify-on-change-only policy: a run posts to Discord only if it actually added or removed something, or hit an error. No-op runs still write their report to disk — the audit trail is untouched — they just don't ping me. Failure alerts always post; going quiet must never mean going dark on a broken run.

The second part is subtler, and it's the detail I'm proudest of. Some books resolve cleanly but simply aren't stocked anywhere on the store. The diff re-offers them every single run, forever. Suppressing them entirely would hide a real signal ("a book you want is now unavailable"); reporting them daily is exactly the noise I was trying to kill.

So the notfound.json ledger turns it into a true day-over-day diff: a newly-unavailable book is reported once, the morning it first appears. After that it's silent. And if it later comes back into stock and gets added — or I drop it from Goodreads — it falls out of the ledger, so it would alert again if it ever went unavailable in the future. Dry-runs preview this without persisting, so a manual test can't silently consume the one real alert the scheduled run owes me.

The principle A good automated system reports changes, not state. The moment it starts narrating its own uneventful heartbeat, you stop reading it — and then you miss the one message that mattered. Silence is a feature.

09Deployment: boring on purpose

The whole thing ships as one Docker image and runs from systemd timers — no orchestration, no always-on process, nothing to babysit. Each destination is a templated service instance with its own data volume, so their browser profiles, caches, and ledgers never mix. Two safety nets guarantee a scheduled run never fails silently:

Runs are also self-housekeeping: old per-run logs and exports are pruned automatically, while working state (the catalog cache, browser profiles, ledgers) is left alone.

10The stack, deliberately small

Python 3.12
The whole service. ~2,300 lines across 16 modules, standard-library-first.
Hardened Playwright build
Headed Chromium that survives Cloudflare's bot detection, for the sites with no API.
httpx
Lightweight HTTP for Goodreads RSS, Open Library lookups, and Discord webhooks.
Open Library
The shared identity layer — works and editions — that makes cross-site matching possible.
Docker + systemd
One image, templated per-sink services, daily timers with catch-up. No daemon.
JSON ledgers
Three flat files are the entire persistence layer. No database; none needed.

Three runtime dependencies. The constraint was the point: a tool I run unattended every day should have as little surface area to rot as I can manage.

11What I'd take to the next thing