← All work

Production win

Anti-Overbooking Consistency Layer

Double-booking: 10/week → 1/week

B2B SaaSPlatform Performance / Reliability

Architecture diagram

Booking consistency layer

The booking API serializes overlapping time-window writes through Redis locks before committing durable truth to PostgreSQL.

Booking user

Arbitrary-duration slot request

Request slot

>

Booking API

Overlap detection and validation

Acquire overlap lock

>

Redis lock

Conflict-space coordination

Commit if clear

>

PostgreSQL

Durable booking truth

Versioned result

>

Idempotency

Safe retry and version checks

Safe response

>

Confirmed booking

Double bookings reduced 10/week to 1/week

  • Locks were scoped to overlapping windows, not just fixed slot IDs.
  • Optimistic checks protected the database commit path.
  • The API favored consistency over over-accepting concurrent requests.

Booking gets tricky very quickly when you are not dealing with neat 30-minute slots and static inventory. Bookings could span arbitrary durations, so overlap logic got messy. Double bookings were happening too often, and users were being told a slot looked available only to lose it under contention. At that point the issue is not just backend correctness, it is trust. If people think the platform cannot hold a booking under load, they start retrying, hesitating, or abandoning the flow.

I should be honest about where I started: the obvious fix, locking at the database, was off the table from day one. The system was highly available, and a DB-level lock did not fit that architecture. So I never tried it and watched it fail, I ruled it out up front and designed around availability with a Redis-based distributed lock instead, with PostgreSQL as the durable source of booking truth. The important shift was to lock around the overlapping time-range conflict space, not just a fixed slot id, for arbitrary-duration bookings the collision is "same window," not "same slot."

It did not get clean overnight, and I would rather tell that part than pretend otherwise. Overbookings still slipped through for a while. They were not catastrophic, but they happened, and when they did, support took the angry customer calls. I did not hide behind the team for those, I got on calls with clients directly, explained what happened, and rebuilt trust. Every incident got a postmortem and was fixed first-go, but "sometimes" is not "never."

So I added a reconciliation safety net: on every booking completion, and hourly, the system checked whether any overlapping booking existed. It fired often, and that is the underrated part. The reconciliation did not just catch overlaps, it gave me the data to understand the real race before I had a proper fix in production. The safety net doubled as a diagnostic.

The shipped design was layered: Redis locks with automatic expiry to prevent deadlocks, optimistic version checks at the database as a belt-and-suspenders layer, and a fully idempotent booking API so retries and double-clicks return the original booking instead of creating a second. After it went in, double-bookings dropped from around ten a week to roughly one every few weeks, and about a year later, once the surrounding code was rewritten cleanly and the trailing tickets were closed, it reached zero. At that point I removed the reconciliation entirely. It had done its job: caught the overlaps, taught me the bug, and earned its own retirement.

What I gave up was some concurrency and some simplicity, stronger coordination always costs something, and under contention the system is deliberately conservative because it is protecting consistency. That is a trade I take every time in a booking flow. The principle underneath all of it: address a problem at the root, not down the road. If I say the slot is yours, it should really be yours.

Tech stack

RedisPostgreSQLRails