Why I Built the Same Pipeline Twice
The cost of dependence isn't on the invoice.
I had a working pipeline.
Tested against real production data. Accuracy validated. Costs understood. Deployment proven. A pipeline that did exactly what it was designed to do, reliably, on a commercial LLM I trusted.
Then I spent several weeks rebuilding it on a completely different model running on my own hardware.
Not because the original didn’t work. Not because the cost was unsustainable. Not because anyone asked me to.
Because depending on a single provider for a core capability is a decision that compounds in ways you don’t feel until it’s too late.
The Case That Didn't Need Making
Let me be direct about the original pipeline. It worked well. Classification accuracy was 75%. The cost per hundred records was under a dollar. The integration was clean, the output was structured, the failure modes were understood.
By every reasonable measure, the rational decision was to leave it alone.
This is exactly the kind of situation where experienced engineers make their worst decisions — not by choosing badly, but by not choosing at all. The system works, so you stop questioning the architecture. The provider is reliable, so you stop thinking about alternatives. The cost is acceptable, so you stop asking what “acceptable” means when it compounds over years.
I’ve spent thirty years building software. The most expensive technical decisions I’ve witnessed were never the ones that failed dramatically. They were the ones that worked well enough to never get reconsidered.
What Dependence Actually Looks Like
When you build on a single provider’s API, you accept a set of constraints that are easy to ignore because they’re not on the invoice.
You accept their pricing trajectory. Today’s cost is known. Next year’s cost is their decision, not yours. The pricing power asymmetry grows with your integration depth — the more you build around their API, the more leverage they have on price.
You accept their availability as your availability. If their service degrades, your pipeline degrades. If their model version changes, your output changes. You didn’t make either decision, but you absorb both consequences.
You accept their roadmap as your roadmap. If they deprecate the model version you’ve tuned your prompts against, you re-engineer. If they introduce rate limits that conflict with your batch processing pattern, you adapt. Your engineering calendar partially belongs to someone else.
None of these are hypothetical risks. Every team that has built production systems on cloud APIs has experienced at least one of them. Most have experienced all three.
The question isn’t whether the provider is good. The question is whether your architecture should have a single point of failure that you don’t control.
The Rebuild
I chose an open-source model — a 32-billion parameter model running locally on consumer hardware. Not a toy. A real model, with real capabilities, that I could run without an internet connection, without an API key, and without anyone else’s permission.
The translation wasn’t trivial.
The context window dropped from generous to roughly 8,000 usable tokens. My prompt engineering, which had grown comfortably verbose over weeks of iteration, had to be rebuilt from first principles. The retrieval layer — which barely existed in the original because the context window was large enough to brute-force the taxonomy — had to be designed, built, and tuned from scratch.
Every one of those forced changes made the pipeline better.
The retrieval layer, born from necessity, reduced wasted context by 94%. The tighter prompts produced more consistent output. The smaller batch sizes, dictated by the context constraint, actually improved per-record accuracy because the model had less noise to navigate.
Final accuracy: 66% combined, against the original’s 75.3%.
The Nine-Point Gap
Let’s be honest about that gap. Nine percentage points is real. In a production context, it means roughly nine more records out of every hundred need human review.
But look at what sits on the other side of that gap:
Zero marginal cost per inference call. I could run a thousand experiments without watching a billing dashboard. I could throw wild retrieval strategies at the problem, test edge cases exhaustively, iterate on prompts without calculating whether the learning was worth the spend. When the cost of experimentation is zero, you experiment differently — more freely, more creatively, more aggressively.
Complete architectural independence. No API changes, no deprecation notices, no rate limits, no terms-of-service changes. The model runs on my hardware, on my schedule, with my configuration. Every decision about how the pipeline operates is mine to make.
Verifiable sovereignty. The data never leaves my machine. For any use case involving sensitive records — and most enterprise data qualifies — this isn’t a feature. It’s a requirement that most teams satisfy with legal agreements rather than architecture. Architecture is more honest.
And — the part that mattered most for the engineering — the constraint of the local environment produced a better-designed system. The retrieval layer, the prompt efficiency, the context management — all of it was forced by the rebuild, and all of it improved the pipeline regardless of which model sat at the center.
Why the Numbers Don't Tell the Full Story
The 75% versus 66% comparison is accurate and misleading at the same time.
It’s accurate because the commercial model did produce higher classification accuracy on the same test set. Raw capability matters, and a larger model with a larger context window doing less-constrained inference will, on average, outperform a smaller model working within tighter boundaries.
It’s misleading because it implies you have to choose one.
The architecture I ended up with — a well-engineered retrieval layer feeding a lean classification prompt — is model-agnostic. The same pipeline design works with the local model, with the commercial API, or with whatever model emerges next year that none of us have heard of yet. Swap the model, keep the architecture, retune the prompts. The engineering investment is preserved regardless of where the inference happens.
That portability is the point.
Building the pipeline twice wasn’t about finding the cheaper option. It was about building an architecture that doesn’t care which option you choose — because the option you choose today won’t be the option you want in two years, and the cost of re-engineering from scratch is always higher than the cost of building portably in the first place.
The Pattern Behind the Decision
I keep coming back to a principle that has guided most of the important architectural decisions in my career:
Build for the constraints you’ll face, not the resources you have.
When resources are abundant — generous context windows, cheap API calls, reliable providers — it’s easy to build architectures that assume the abundance is permanent. It never is. Pricing changes. Providers pivot. Models get deprecated. The constraints always arrive eventually; the only question is whether your architecture was designed to absorb them or collapse under them.
Building the pipeline twice was the specific expression of a general principle: if a core capability depends on a single external decision you don’t control, the architecture has a flaw that doesn’t show up in today’s performance metrics but will show up in next year’s engineering calendar.
The first pipeline was built for today’s resources. The second was built for tomorrow’s constraints.
I’ll use both. But I’ll sleep better knowing that the one I control exists.

