When we started Bugspot, we wrote our own job queue. The reasoning was reasonable at the time: we already had Postgres, we didn’t want to run Redis, and the surface area of “store a row, lock it, run it, mark it done” looked like a weekend’s work.
It was. The weekend’s work shipped, ran fine for months, and then started costing us real time.
The parts that broke first weren’t job execution — that part was fine. The parts that broke were the things we’d quietly skipped:
- Retry semantics. Our retry was “try again in N minutes.” Real retries need exponential backoff with jitter, plus a way to tell “transient” from “permanent” failures. We accreted that, badly.
- Observability. Watching a job get stuck meant writing an ad-hoc SQL query. There was no inbox, no failed-jobs view, no histogram of run times. We kept meaning to build it.
- Leader election. Periodic jobs — clean up sessions, expire trial accounts — need exactly one runner. Our answer was a hand-rolled advisory lock that worked until the day a rolling deploy left two workers thinking they were both leader.
- Backpressure. A single misbehaving job type could fill the queue and starve everything else. Per-queue concurrency limits exist precisely to prevent this. Ours didn’t.
We swapped to River last month. River is Postgres-native — same database, no new infra — and it ships all of the above. The migration was a couple of days of work, mostly translating job definitions and re-pointing the worker process.
What we got back: a UI for failed jobs, exponential backoff that we didn’t write, leader election that just works, and a couple thousand lines of homegrown queue code deleted. The worker runbook went from three pages of “if X happens, run this query” to half a page of “open River’s dashboard.”
The lesson isn’t that custom queues are wrong. It’s that “we already have Postgres” doesn’t mean the queue is free — the queue is the easy part, and the surrounding infrastructure is most of the cost.