Building an Enterprise LLM Platform That Actually Ships

Building an Enterprise LLM Platform That Actually Ships

Building an enterprise LLM platform that actually ships

Most enterprise LLM projects die in pilot. I've watched it happen from the inside, and I've been the one picking through the wreckage afterwards. A team spins up a proof of concept with OpenAI's API, someone demos it to the board, everyone gets excited, and then nothing ships. Six months later the budget is gone and the Slack channel is archived.

I spent the better part of two years building an LLM platform at a global company, fine-tuning pipelines, RAG infrastructure, prompt engineering frameworks, safety controls, the lot. We deployed 15 production applications. We got 90% accuracy on domain-specific tasks and cut content creation time by 70%. But none of that happened because we were clever. It happened because we were boring in the right places.

Why pilots die

The first thing that goes wrong is nobody builds an evaluation framework. Teams get a demo working, the output looks plausible, and they move on. Then it goes to production and starts confidently generating nonsense about products that don't exist. Without evaluation, you genuinely cannot tell the difference between a model that works and one that merely sounds like it does.

Cost kills the rest. GPT-4 class models are expensive at volume. I've seen teams burn through five figures in a month on a single application because nobody modelled the token economics beforehand. Finance notices. The project gets paused for "review." It never comes back.

Then there's governance. Who owns the model outputs? What happens when the system generates something defamatory? Who approved the training data? In a regulated environment these aren't theoretical. They're the questions legal and compliance will ask before anything goes live. No answers, no launch.

Eval first, everything else second

We built the evaluation framework before we built the platform. This felt wrong at the time, the team wanted to start building applications and I was insisting on test harnesses.

We defined accuracy metrics per domain. We built automated evaluation pipelines that ran against golden datasets. We tracked hallucination rates weekly. Early on, our hallucination rate on product descriptions sat around 23%. Roughly one in four outputs contained fabricated attributes. That number had to come down below 5% before we'd let anything near a customer-facing surface.

The eval framework also gave us something political: evidence. When a stakeholder asked why we couldn't just use ChatGPT directly, we had data to point at instead of opinions.

Pydantic saved us more than once

Structured outputs were non-negotiable. Every LLM call in our platform returned data validated against Pydantic schemas. This sounds like over-engineering until your model returns a JSON blob with a missing field and your downstream service falls over at 2am on a Sunday. We caught malformed outputs at the boundary, logged them, retried or fell back gracefully. Schema validation is tedious work. It was also the single biggest contributor to production stability.

The cost problem nobody talks about

Our first month in production, one application cost us four times the budget. The culprit was context windows. We were stuffing entire documents into prompts because it was easy and the retrieval pipeline wasn't ready yet. Easy and expensive turned out to be the same thing.

We built a cost allocation model that tracked spend per application, per team, per model. Teams got dashboards. They could see what their prompts cost. Behaviour changed fast once prompt length showed up as a line item their manager could read.

LoRA fine-tuning helped on tasks where we had enough domain data. Instead of sending massive prompts to a general purpose model, we fine-tuned smaller models for specific jobs: product categorisation, compliance checking, metadata extraction. Cheaper to run and often more accurate on narrow tasks. Not always, though. We had one fine-tuning run that produced a model so confidently wrong about regulatory classifications we had to kill it entirely. The training data was the problem. It usually is.

RAG: harder than the blog posts suggest

We used pgvector for our vector store, mainly because we were already running PostgreSQL and didn't want another database to babysit. Not the fastest option, but operationally simple. Our SRE team could keep doing what they already knew how to do.

The hard part of RAG was never the vector search. It was chunking. How do you split a 200-page technical manual so the retrieved passages actually contain the answer? We went through three strategies. Fixed-size splits were terrible. Semantic splitting by section headers was better but missed answers that spanned sections. Overlapping chunks with document structure metadata got us to good enough. Three attempts in four months. That was just the chunking.

LangChain handled orchestration and I have mixed feelings about it. It sped up prototyping enormously but introduced abstractions that made production debugging painful. We ended up writing custom chains for critical paths and kept LangChain for the lower-stakes pipelines. Honest assessment: I'm still not sure the framework earned its complexity on our project. Maybe on a smaller one.

The politics of governance

Building the responsible AI framework was the hardest part of the whole project. Almost none of that difficulty was technical.

We needed sign-off from legal, compliance, information security, data privacy, and the business units. Each group had different anxieties. Legal worried about IP in training data. Compliance worried about the EU AI Act. Infosec worried about prompt injection attacks. The business units worried we'd slow them down with process. All of them were right to worry.

We built governance into the platform itself. Every application went through risk assessment before deployment. Every model had a data lineage record. Every prompt template was version-controlled and auditable. This slowed us down in Q1. By Q2 it was the reason we shipped faster than anyone else internally, because we'd already answered the questions that were blocking other teams for weeks.

The worst moment was finding out that a business unit had tried to deploy a customer-facing chatbot without going through the review process. We found out because someone in legal forwarded us a screenshot. That conversation was not pleasant. But it was the conversation that got governance taken seriously at exec level, so I suppose I should be grateful for it.

What the numbers actually mean

Fifteen production applications sounds tidy. We killed about the same number. Some failed in evaluation because accuracy was too low. Some failed on cost because the economics didn't hold at scale. Two died because the business problem they were solving turned out to be fictional, people thought they needed something, built a case for it, and then nobody used the output.

The 70% reduction in content creation time was real, measured over six months against a control group. But it took eight months of iteration to reach that figure. The first version saved maybe 20%, and people were underwhelmed. Patience is a surprisingly large part of shipping AI products.

The 90% accuracy number is an average across domain-specific tasks after fine-tuning. Some tasks hit 96%. Others sat stubbornly at 84% and we had to make a judgment call about whether that was acceptable for the use case. Sometimes it was. For the compliance classification task, it wasn't, and we spent another two months on training data before it got where it needed to be.

If I were starting over

Cost model on day one. Not week six. That delay cost us credibility with finance that took months to rebuild.

I'd also spend far more time on RAG chunking strategy before writing any application code. We treated it as a configuration problem when it was really a data modelling problem, and paid for that misunderstanding in rework.

And I'd hire a technical writer to document the governance process. Engineers writing policy documents produce policy documents that nobody reads. I know because I wrote several of them.

Enterprise LLM platforms don't ship in a quarter. Ours took about a year to reach the point where we trusted it, the business trusted it, and compliance trusted it. The gap between a compelling demo and a production system that people actually rely on is enormous. Most of the work lives in that gap, and most of it isn't interesting. Eval harnesses, cost dashboards, governance paperwork, chunking experiments, schema validation. The tedious parts are where the platform gets built. Everything else is just a demo.