Event-Driven Intelligence: Moving Retail Decisioning from Batch to Real-Time

Event-driven intelligence: moving retail decisioning from batch to real time

I spent years watching smart people make decisions on stale data. Not because they wanted to, but because the infrastructure gave them no choice. A merchandiser would pull a markdown report at 9am, and by the time they'd worked through the spreadsheet, the shelf conditions had already changed. The report described yesterday. The decision needed to land today.

That gap between when something happened and when someone could act on it was the actual problem. Not the analytics. Not the models. The plumbing.

The batch trap

Most retail data platforms I'd seen ran overnight ETL jobs. Warehouse data landed by 6am if nothing broke, which it frequently did. Reports materialised by 8 or 9. Analysts built their views. Merchants made calls.

The cycle time from event to action was 12 to 24 hours. For a business operating across 130 countries with thousands of SKUs moving through supply chains continuously, that meant every decision carried built-in lag. Markdown timing, stock rebalancing, promotional response -- all operating on information that was already hours old by the time a human touched it.

Nobody questioned this because it was how things had always worked. The overnight batch was so deeply embedded in the operating culture that people planned their days around its output schedule. Meetings were booked for 9:30 because the data wouldn't be ready before then.

Why Kafka, and what we actually built

When we started designing the platform, the first decision was Kafka as the backbone. We needed a durable, ordered event log that 15+ business domains could subscribe to independently without coupling to each other. Kafka gave us that, plus the ability to replay history when things went wrong -- which they did, regularly, in the early months.

The architecture had a few layers worth explaining:

Event sourcing. Every business event -- a price change, a stock movement, a customer interaction -- got written as an immutable event to Kafka topics. No more "current state only" tables that lost the history of how you got there. When the commercial team wanted to understand why a markdown sequence had played out badly, we could replay the full chain of events rather than trying to reconstruct it from snapshots.

Schema registry. With dozens of teams producing and consuming events, schema drift would have killed us within weeks. Confluent Schema Registry gave us contract enforcement between producers and consumers. A team couldn't push a breaking change to an event format without the registry rejecting it. This sounds mundane. It was probably the single most important thing we did for long term maintainability. Without it, we'd have spent all our time chasing deserialization failures instead of building anything useful.

Stream processing. Raw events are noise. The value sits in computed state -- aggregations, joins across streams, windowed calculations. We built stream processing layers that turned raw signals into things a human or a model could act on. "This product is selling 3x faster than forecast in the last 90 minutes" is a useful signal. A raw stream of individual transaction events is not, or at least not to a merchandiser staring at 400 SKUs.

ML inference at the edge of the stream. This is where the investment started paying back in ways the business could feel. We deployed ML models directly on the stream, scoring events as they arrived rather than running batch predictions overnight. A demand signal hitting Kafka could trigger a pricing model, produce a recommendation, and surface it in a dashboard within seconds. The old batch process took hours to produce the same output, and by the time it arrived, the input data was already stale.

What "hours to milliseconds" actually means

Latency numbers are easy to throw around. What mattered was the decision quality on the other end.

Take markdown decisions. In fashion retail, the timing of a markdown can swing margin by several percentage points on a line. Mark down too early and you leave money on the table. Too late and you're stuck with stock that won't move at any price. The old process had merchandisers working from yesterday's sell-through rates, applying judgment to data that had already drifted.

With event data feeding ML models in real time, the system could flag a product falling behind its sell-through curve within hours of the trend starting -- not the next morning. The merchandiser still made the call. We didn't automate the decision away. But the signal arrived while there was still time to act on it, which is a different thing entirely from receiving a report that confirms what you suspected yesterday.

Across the business, operational efficiency improved by about 45%. Most of that came from eliminating the lag between signal and response, not from any single algorithm. Faster data, same humans, better outcomes.

Data mesh, because one team can't own everything

Fifteen business domains can't all funnel through a central data team. We adopted a data mesh approach: each domain owned its own events, schemas, and stream processing. My team owned the infrastructure -- Kafka clusters, schema registry, monitoring, the shared tooling that made it possible to stand up a new domain in days rather than months.

The alternative, a centralised team trying to model every domain's events, doesn't scale past about three domains before it becomes a bottleneck. I've watched that play out at multiple organisations. The central team becomes a ticket queue. The business goes back to spreadsheets because they're faster than waiting six weeks for a data engineering sprint slot.

The agent infrastructure question

There is plenty of talk about agentic AI right now. Autonomous systems that monitor, decide, and act. Most of the conversation I hear skips straight past the question of where the agent gets its data.

If your agent reads from a warehouse that refreshes overnight, you have an expensive batch processor with a chat interface. It can't respond to something that happened 20 minutes ago because it doesn't know about it yet. It reasons over yesterday's state of the world, which is exactly the problem we were trying to solve for human decision makers five years ago.

Event infrastructure -- Kafka, schema registry, stream processing, inference on the stream -- is what makes agents genuinely useful. An agent that subscribes to a topic, receives events as they happen, runs inference, and acts within seconds is doing something really different from one that polls a database every hour. The first one can catch a demand spike while it's still building. The second one writes you a summary of the spike the next morning.

Organisations that invested in real time data platforms years ago now have a head start on agentic systems. The ones that didn't are finding that the distance between "we have an AI strategy" and "we have the infrastructure to execute it" is measured in years, not months. You can't skip the plumbing.

One thing I'd do differently

I'd have pushed harder on schema governance from day one. We got the registry in early, but the cultural work of getting teams to treat event schemas as public APIs -- versioned, documented, owned -- took longer than any of the technical work. The technology was the easy part. Changing how 15 teams thought about data ownership was the hard part, and I underestimated it.

If you're building this kind of platform now, start there. The Kafka cluster will be fine. The schema arguments at 4pm on a Thursday are what will actually determine whether the thing works.