Building LLM Features That Don't Rot After Launch

Your chatbot works beautifully in the demo. Six months after launch it's hallucinating half the time, costs three times what you projected, and nobody on the team can tell you why. This is the rot problem, and it's almost always caused by engineering decisions that were never made.

The rot nobody talks about

LLM features rot silently. A traditional software bug crashes, throws an error, shows up in a stack trace. An LLM bug produces fluent nonsense. The system is happy. The user is confused. By the time a pattern is noticeable in support tickets, the model has been drifting for weeks.

This happens for four reasons that stack on top of each other. Foundation models change under you. Your data distribution shifts. Prompts that were tuned for edge cases six months ago no longer match the cases that are actually arriving. And cost per request creeps up as users push the system harder than it was scoped for.

None of that is an LLM problem. It's an engineering problem. The teams that build LLM features that age well treat model behavior as a software surface that needs the same discipline as a database or a payment gateway.

Evaluation is not optional

The single biggest predictor of whether an LLM feature will still work in a year is whether the team set up an evaluation harness before launch. Not after the first regression. Before.

What an eval harness actually includes

A frozen evaluation set. 100 to 300 real examples with expected answers or acceptance criteria. You write it once, you don't touch it, and you run it against every prompt change.
Automated scoring. A mix of exact-match, semantic similarity, and LLM-as-judge where appropriate. The goal is not perfect grading, it's catching regressions early.
A dashboard that compares runs. New prompt vs old prompt, new model vs old model. You need to see deltas at a glance, not dig through logs.
Production sampling. A percentage of real traffic gets logged and spot-checked every week. That's how you catch the drift your eval set won't show.

Without this, every prompt change is a guess. With it, you know within an hour whether a change helped or hurt. It's the difference between engineering and vibes.

Every LLM feature has a shelf life. The teams that treat that as a given ship systems that age well. The ones that don't ship demos that age badly.

Cost monitoring from day one

The second silent killer is cost. Not because the unit economics are bad, but because nobody is watching them. A product ships with a well-scoped prompt, users start using it in ways nobody anticipated, context windows expand, and suddenly the cost per request has doubled. You find out on the invoice.

The fix is cheap if you build it in from day one. Log token counts and dollar cost per request. Tag them by user, feature, and tenant. Alert on anomalies. The cost layer should look a lot like your error layer, instrumented, visible, and on-call.

This gets more important, not less, as you scale. We've seen teams whose AI feature shipped at a 15% gross margin and quietly dropped to negative margin inside a quarter because nobody instrumented it. That's an engineering failure dressed up as an infrastructure surprise.

Minimum viable observability: log every request, score every response, alert on drift and cost anomalies. If you can't answer "what is this feature costing us per user per month" without opening a bill, you don't have observability yet.

Version pinning, and why you'll regret not doing it

Foundation model providers update their models. Sometimes they tell you, sometimes they don't. The same prompt that worked yesterday can produce materially different output tomorrow. If you want predictability, you need to pin versions.

This is a trade. Pinned models age. Their newer versions are usually better and cheaper. But uncontrolled upgrades introduce variance that your eval harness has to catch. The right pattern is explicit: pin versions in config, upgrade on your schedule, run the eval harness against the new version before promoting it, and always keep a fallback to the prior version available.

This is also why engineering belongs in the discovery room. The team that scopes "we'll use the best model" without considering version lifecycle is scoping future pain without knowing it.

The stack that keeps LLMs honest

There's no magic framework that solves this. The teams that do it well assemble a small set of boring, well-understood tools:

Prompt management: versioned, source-controlled prompts. Not scattered strings in a handful of files.
Evaluation tooling: something like a spreadsheet plus a script, or a dedicated platform like Promptfoo, LangSmith, or Braintrust. The platform matters less than the discipline.
Observability: request logging, response scoring, cost per request, latency distributions. Real dashboards, not an exported CSV.
Feature flags: gate LLM features behind flags so you can roll back without a deploy.
Fallbacks: cheaper or smaller model as a circuit breaker, and graceful degradation when the primary model is unavailable.

None of this is novel. It's the same discipline you'd apply to a payment pipeline or a search system. The only reason it feels new is that LLMs arrived so fast that a lot of teams skipped the engineering basics.

It's a discipline, not a vibe check

The LLM features that still work eighteen months from now will be the ones whose teams set up the plumbing before they launched. Not because they were more cautious, but because they were honest about what was going to happen, the model changes, the users change, and the cost creeps.

The teams that ship demos and skip the plumbing are going to spend the next year on rework. The teams that shipped slightly later with evals, monitoring, and version pinning will be shipping the second and third features while the first team is still trying to figure out why the chatbot broke. That gap widens every quarter.

This is also why team shape matters. Embedded senior engineers who can hold both the product context and the infrastructure context are the ones who build features that age well. When you're thinking about how to scale this kind of work, it's worth looking at how nearshore engineering teams compare to offshore for work that demands this level of ownership.

Metova Editorial

Metova

Metova Editorial is the shared byline for pieces written and reviewed by our engineering, design, and product leadership. Twenty years of building software that ships, distilled into the notes we wish someone had written for us.