Syncing inventory and orders across platforms without losing events

The ticket said “we sold something we didn’t have.” Again. The storefront showed three units available, the management system said zero, and a customer had just paid for one of the ones that didn’t exist. The operations team already had a ritual for this: export both spreadsheets on Monday morning, cross-check them by hand, and fix the differences before someone else bought air.

Stock lived in one system, orders in another, and between them sat a cron that copied the full state from one side to the other every fifteen minutes. It worked 99% of the time. The problem is that an eCommerce with real traffic doesn’t break in the 99%: it breaks in the 1% that lands right at peak sales, and that 1% is exactly where the cron fails.

This is the architecture that got me out of cross-checking spreadsheets by hand: stop syncing state, and start syncing facts.

The cron that lies to you

The periodic-cron pattern is tempting because it’s simple: every fifteen minutes, read all of source’s stock and write it to destination. If both sides agree, nothing happens. If they don’t, last writer wins.

And that’s the trap, in three shapes:

The photo is born stale. Between the cron reading source and finishing the write to destination, seconds or minutes pass in which sales happen. Those sales weren’t in the photo, so destination ends up with a number that’s no longer true.
The last writer overwrites the previous one. If a sale and a restock happen in the same window, the order the cron applies them in decides the result. Apply the restock after reading but before the sale, and you just gave away stock.
A deploy mid-run eats an entire pass. The cron isn’t transactional with the rest of the world. If the process dies halfway, there’s no record of what it copied and what it didn’t. The next run starts from scratch and prays.

None of these is fixed by running the cron more often. Running it every minute instead of every fifteen only shrinks the error window, it doesn’t remove it, and it multiplies the load on two systems already busy selling.

Stop syncing state; sync facts

The fundamental shift is to stop asking “how much stock is there now?” and start recording “what happened?”. A sale isn’t a new stock number: it’s the fact one unit of SKU X was sold by order Y, occurring at a precise moment. A restock is another fact. Current stock is simply the result of applying every fact in order.

A fact has three properties a snapshot doesn’t: it’s immutable (it happened, you don’t rewrite it), it’s ordered (you know which came first), and it’s idempotent to apply if you design it well (applying it twice gives the same result as once). Those three properties are exactly what the cron was missing.

Each change is born as a fact in the outbox table, inside the same transaction that caused it. The relay publishes it to the bus in order, and the consumer applies it exactly once at the destination.

From here, the cron’s three problems turn into three concrete design decisions.

Hole 1: the event lost between your database and the bus

The first instinct when moving to events is: when I process a sale, I write to my database and then publish the event to the bus. Two operations, two systems. And that’s where the subtlest bug of all lives, the dual write: if the database commits but the publish to the bus fails (a timeout, a deploy, the bus down for a second), the sale exists but the event doesn’t. Nobody outside finds out. Stock ends up wrong with no trace of why.

You can’t make the two operations atomic if they’re two different systems. But you can write the change and its event in the same database transaction, into an outbox table:

-- The business change and its event are written in the SAME transaction.
-- If the transaction commits, both exist. If not, neither. Never one without the other.
BEGIN;

UPDATE inventory
   SET qty = qty - 1
 WHERE sku = 'ACME-001' AND qty >= 1;

INSERT INTO outbox (id, aggregate, aggregate_id, seq, type, payload, created_at)
VALUES (
  gen_random_uuid(),
  'inventory', 'ACME-001',
  nextval('inventory_seq'),
  'stock.decremented',
  '{"sku":"ACME-001","delta":-1,"reason":"order:9087"}',
  now()
);

COMMIT;

A separate process, the relay, reads the outbox and publishes to the bus. The golden rule is that it marks an event as sent only after the bus confirms it received it:

// Acme/Sync/Relay.php — reads the outbox in order and publishes.
// If publish() fails, it is NOT marked as sent: the event stays and is retried.
foreach ($this->outbox->unsent(batch: 100) as $event) {
    $this->bus->publish(
        topic: $event->type,
        payload: $event->payload,
        partitionKey: $event->aggregateId, // same SKU -> same partition (see hole 2)
    );
    $this->outbox->markSent($event->id);
}

The worst case now isn’t losing an event: it’s sending it twice (if the relay dies right between publish and markSent). And we solve that in hole 3, on purpose. We trade “can be lost” for “can be duplicated,” because the latter is fixable and the former isn’t.

Hole 2: events that arrive out of order

An event bus doesn’t guarantee the consumer receives things in the order they happened, unless you ask for it explicitly. And for inventory, order matters: applying +5 restock and then -1 sale gives a different result than the reverse if the sale arrives when there was no stock yet.

Two pieces fix this together:

Partition by the aggregate. Every event for the same SKU has to go to the same bus partition, using the aggregate_id as the key (the relay’s partitionKey above). That way, within a SKU, order is preserved. Across different SKUs it doesn’t matter, and you gain parallelism for free.
Number each event per aggregate. That seq in the outbox is a per-SKU counter. The consumer uses it to discard latecomers: if it already applied event number 7, a number 5 showing up afterwards is a straggler and gets ignored.

Hole 3: the event processed twice

From hole 1 we inherit a consumer that can receive the same event more than once, and from the bus the reality that almost all of them deliver “at least once.” So the consumer has to be idempotent by design: processing the same event twice can’t change the result.

The most solid way is to record which events you’ve already seen and apply the change in the same transaction, leaning on the seq from hole 2 to also ignore stragglers:

-- The consumer applies the event and records that it saw it, atomically.
BEGIN;

-- 1) Have I processed this event? event_id is unique; if it's already there, nothing inserts.
INSERT INTO processed_events (event_id) VALUES ('e1f9...')
ON CONFLICT (event_id) DO NOTHING;
-- If no row was inserted, it's a duplicate: empty COMMIT and ack to the bus. Done.

-- 2) Apply the change only if this event is newer than the last one seen for the SKU.
UPDATE remote_inventory
   SET qty = qty + :delta,
       last_seq = :seq
 WHERE sku = :sku
   AND last_seq < :seq;  -- a straggler (lower seq) touches nothing

COMMIT;

With this, the consumer is immune to relay duplicates, bus retries, and late-arriving events. The price is a processed_events table you have to prune (a job that deletes anything older than the bus’s retention window), but it’s a cheap price for sleeping at night.

In distributed systems you don’t choose between “can fail” and “can’t fail.” You choose which kind of failure you’d rather have, and good designs choose the failure you can actually repair.

The safety net: reconciliation, not hope

So far the hot path is correct. But “correct in theory” and “correct for two years in production” are different things, and the difference is assuming that something, at some point, will desync anyway: a bug in a new consumer, a malformed event, a half-finished migration.

That’s why the design doesn’t end at the event flow. It ends with a reconciliation process that, every so often, compares the state of the two systems and corrects the differences. It sounds like the cron from the start, and that confusion is dangerous, so the difference matters:

The cron from the start was the sync mechanism: if it failed, there was nothing else.
Reconciliation is a safety net over a flow that’s already correct. It doesn’t move the bulk of the work; it only looks for drift that shouldn’t exist, reports it, and corrects it. If it finds a lot, that’s the alarm that something in the event flow is broken.

Reconciliation fixes the symptom; the metrics, below, tell you to go fix the cause.

Catch the drift before the customer does

The end goal isn’t to never fail. It’s to find out before the customer does. Three signals are worth more than any pretty dashboard:

Consumer lag: how many events sit in the bus unprocessed. If it climbs and doesn’t come down, the destination is falling behind and you’re on your way to selling air.
Differences per reconciliation: how many SKUs the last pass corrected. In a healthy regime it should be zero or near it. A jump is the early sign that a consumer broke.
Age of the oldest unsent event in the outbox: if the relay stalled, this grows. It’s the first thing I check when “stock isn’t updating.”

When these three are on a panel with alerts, the “we sold something we didn’t have” ticket stops arriving from the customer side and starts arriving from monitoring, which is where it should arrive.

Checklist: inventory sync without losing events

Model changes as immutable events (stock.decremented, stock.replenished), not as state that gets copied.
Write the change and its event in the same transaction, into an outbox table. Never publish to the bus directly from business logic.
A separate relay reads the outbox and publishes; it marks as sent only after the bus confirms.
Partition by aggregate_id (the SKU) to guarantee order within each aggregate, and number events with a per-aggregate seq.
Make the consumer idempotent: record processed event_ids and apply only if the seq is newer than the last one seen.
Add a periodic reconciliation as a safety net, not as the main mechanism.
Measure consumer lag, differences per reconciliation, and age of the oldest outbox event. Alert on all three.

Closing

The fifteen-minute cron wasn’t badly written. It was solving the wrong problem: it treated a continuous, ordered stream of facts as if it were a photo you could just take again. Once the mental model went from “copy the state” to “carry facts, in order and exactly once,” inventory stopped being a source of tickets and the Monday spreadsheets disappeared.

The thesis I take away, and that comes back every time two systems have to agree: don’t ask how to keep them identical, ask how to tell one, without losing or reordering, what happened in the other.

If you have two systems that insist on disagreeing and you’re already cross-checking spreadsheets by hand, this is the kind of problem I work on. You can reach me.