MLOps at Scale: Model Deployment in Days, Not Months

MLOps at scale: model deployment in days, not months

Two years ago, getting a model from a data scientist's notebook into production took us somewhere between six weeks and four months. The process involved a Confluence page nobody read, a series of meetings where people argued about test coverage thresholds, a handoff to a platform team who had to reverse-engineer the training code, and eventually a deployment that may or may not have matched what was validated in staging.

We had over 50 models in production across a global platform. Every one of them got there painfully. And the worst part wasn't the slowness. It was that by the time a model finally landed in production, the data it was trained on was already stale.

Why it took months

Nobody had built the plumbing.

Data scientists trained models on their laptops or in shared notebooks. Features were computed ad hoc, sometimes differently between training and serving. There was no feature store, so the same business logic got reimplemented three or four times, each version slightly different. When predictions in production didn't match what the data scientist saw in their notebook, nobody could tell if it was a code bug, a data skew issue, or both.

Validation was manual. Someone would pull a sample, eyeball the distributions, compare a few metrics against the previous model, and write "looks good" in a Jira ticket. No automated checks. No regression suite. No record of what "good" meant for that particular model.

Monitoring was worse. We'd find out a model had degraded when a product manager noticed a dashboard moving the wrong direction. By then, drift might have been accumulating for weeks. One pricing model ran for three months on stale features before anyone caught it. That was an expensive quarter.

Then there was the ownership question, which was really a political question. Who is responsible when a model's predictions go wrong? The data scientist who trained it? The platform engineer who deployed it? The product team who defined the success metric? Nobody wanted that accountability, so everyone built their own safety buffers. Safety buffers, in practice, meant more meetings and more delay.

What we built

The MLOps platform wasn't one project. It was a series of infrastructure investments over about eighteen months, none of which made for a good slide deck.

Feature store

We started here because it solved the most embarrassing problem: training/serving skew. A centralised feature store meant features were computed once, versioned, and served consistently to both training pipelines and production inference. Data scientists stopped copy-pasting SQL into notebooks, which alone probably prevented a dozen bugs a quarter.

Model registry and versioning

Every model got a lineage record. Which dataset version trained it, which features it consumed, which hyperparameters were used, what the evaluation metrics looked like at training time. When something went wrong in production, we could trace backwards instead of guessing. Tracing backwards sounds obvious. Before the registry, the actual debugging process was closer to archaeology.

Automated validation gates

This was the most politically difficult piece. We defined per-model validation criteria: accuracy thresholds, fairness checks, latency budgets, data quality checks on input features. A model couldn't reach production without passing every gate. No exceptions, no "let's just push it and monitor."

Data scientists hated this at first. It felt like bureaucracy. Within about three months, it became the thing they valued most, because it meant they stopped getting paged at 2am for problems that should have been caught before deployment.

Canary deployments for models

We borrowed the canary pattern from software engineering. New model versions served a small percentage of traffic alongside the incumbent. We compared prediction distributions, latency, and downstream business metrics. If the canary looked wrong, it rolled back automatically. No human in the loop for the rollback decision.

Shadow scoring came before canary. Before a model went live at all, it ran in shadow mode: receiving real production traffic, generating predictions, but not serving them to users. We compared shadow predictions against the live model for a few days. This caught problems that offline evaluation missed, particularly edge cases in production data that didn't exist in test sets. A model could look perfect on held-out data and still behave strangely when it hit real user traffic at scale.

Drift detection and automated retraining

Models decay. User behaviour shifts, upstream data sources change, and a model trained on last quarter's data makes worse predictions this quarter. We monitored both input feature distributions and prediction output distributions. When drift exceeded a threshold, automated retraining kicked in. The new model went through the same validation gates and canary process, and if it passed, it promoted itself.

About 90% of model lifecycle management ended up automated. The remaining 10% were models where the business context had shifted enough that retraining on recent data wasn't sufficient. Those needed a human to rethink the approach entirely. No amount of automation fixes a model whose premise is wrong.

The cultural fight

The hardest part had nothing to do with infrastructure.

Data scientists, at the time, did not think of themselves as software engineers. Their work lived in notebooks. Version control was optional. Tests were something other people wrote. The idea that their model code would go through CI/CD -- linting, unit tests, integration tests -- felt like an insult to their expertise.

I get why. They'd spent years developing statistical intuition, and here was a platform team telling them their code had to meet the same bar as a microservice. It felt reductive.

But models in production are software. They take inputs, produce outputs, and when they break, users suffer. The conversation that actually shifted things wasn't about standards or process. It was about on-call. Once data scientists understood that proper CI/CD and automated validation meant fewer midnight pages and fewer "the model is broken" escalations from product teams, adoption followed. Not enthusiastically. But steadily.

The ownership question resolved itself once tooling existed. With a model registry, validation gates, and monitoring, accountability became traceable rather than political. The data scientist owned model quality. The platform team owned infrastructure reliability. The product team owned the decision of which model to use and when. Clear lines, backed by tooling rather than org charts.

What "days not months" actually means

I want to be precise about this, because the claim is easy to misread.

A data scientist with a trained model that uses existing features from the feature store can get it into production -- fully validated, shadow scored, canaried, promoted -- in two to three days. Previously, six to eight weeks minimum.

A new model that needs new features built in the feature store: one to two weeks. Previously, three to four months.

An automated retrain triggered by drift detection goes from training completion to production in about four hours, with no human involvement. Previously, retraining was manual. Someone had to notice the degradation first, then repeat the entire deployment cycle. Two to three weeks if you were lucky.

Model reliability improved by roughly 85%, measured as a reduction in production incidents: rollbacks, emergency fixes, prediction errors that reached customers. That number is mostly about validation gates and monitoring rather than speed, but speed helped too. When you can deploy a fix in hours instead of weeks, problems stay small.

What I'd do differently

Build the feature store first. We started with the model registry, which was useful but didn't fix training/serving skew, the thing causing the most production incidents. Six months of pain for a sequencing mistake.

I'd also make the validation gates configurable per model from day one. Our first version was one-size-fits-all. Thresholds were too loose for high-stakes models and too strict for experimental ones. We spent another quarter fixing that, and in the meantime we had both false confidence and unnecessary friction running in parallel.

None of this work demos well. Nobody gets promoted for building a feature store. But if you're running models in production at any real scale, it's the difference between a team that ships and a team that firefights. We stopped firefighting.