Build vs. buy: The real token economics of agent observability
Table of Contents
I talk to engineering teams from different organizations every day, and many of them are running the same mental model when it comes to large technology projects: build first, then buy when it breaks.
For most infrastructure decisions, this is fine; in fact, it is the most rational. You understand your requirements better after building something yourself, and early-stage problems often don’t justify the overhead of a vendor relationship.
The democratization of AI has further complicated this question. Tools like Claude Code and Cursor have completely changed the state of engineering, making it so that anyone can essentially build any type of SaaS alternative quickly and easily… right?
Well, as many teams are finding out, building, deploying, and maintaining specialized software using AI-assistance is not always as easy as it looks. Additionally, one thing that’s often not talked about is that doing so is often also incredibly inefficient from a cost perspective.
As we hear more and more conversations around token economics in the industry – particularly as enterprises massively scale the number of agents they’re using – the cost question cannot be ignored.
I want to take a look at this issue through the lens of agent observability and walk through why the economics break down in a non-obvious and systematically underestimated way.
The short version: building observability on top of general-purpose LLMs without purpose-built tooling is a token spend problem masquerading as an engineering problem. And it compounds.
The hidden variable in every build decision
When teams decide to build their own observability layer — using Claude, GPT-4, or another frontier model as the evaluation and monitoring brain — they’re making an implicit assumption: that the token cost of running that system is manageable and predictable.
The truth is that it’s neither.
Here’s an example of what a typical scenario looks like:
A team spins up a DIY observability stack using an LLM to evaluate outputs, check for drift, and run evals on agent behavior. Early on it looks cheap, with very manageable token volumes as the system is running on a subset of traffic. The team is optimistic.
The agent fleet inevitably scales, sometimes exponentially. Many organizations now have agents building child agents for sub-workflows, may times over. You can see how token costs scale alongside the rapid growth in overall agentic system.
A multi-turn conversation that starts at 5K tokens might hit 40K+ by turn 10. That is a fundamental property of how context windows accumulate. If your observability system is also token-hungry, you’re burning on both sides of the equation simultaneously.
The data we’re seeing from teams who’ve built DIY observability stacks tells a consistent story: token costs on the observability layer alone run 4 to 6 times higher than what they’d pay for purpose-built tooling that’s been optimized for this specific use case. That is the situation today, but I can foresee these multiples getting higher tomorrow.
This gap exists because purpose-built observability systems are designed around the actual telemetry structure of agent workflows — traces, spans, tool call chains — rather than passing full context windows to a general-purpose LLM and asking it to evaluate what happened. The information density per token is fundamentally different.
Three real numbers
Let me ground this in concrete examples that might reflect the numbers a realistic company could be dealing with.
The scale problem: 4-6x token cost differential
A mid-sized data organization — multiple agent workflows running in production, modest but growing scale — decides to build their observability layer in-house. Based on my calculations and some standard assumptions, their annual token bill on the observability stack alone would run to roughly $120K-$180K once they accounted for the full eval workload.
The assumptions:
- 5 agent workflows running in production, each handling roughly 200 interactions per day
- Average context window per interaction: 8K tokens (input + output combined), growing as multi-turn conversations accumulate
- Eval coverage: 100% — every interaction is passed to the LLM evaluator to check output quality and behavioral compliance
- DIY multiplier: 4x — the evaluator re-processes the full interaction context, adding roughly the same token volume again per eval pass; in practice teams run multiple eval dimensions (quality, safety, drift), pushing this higher
- Model pricing: ~$3/million tokens (mid-tier frontier model, blended input/output rate)
By contrast, the Monte Carlo equivalent for a company at this exact same scale runs approximately $30K annually, all-in including our own pass-through cost.
The difference in pricing is significant – one-fourth to one-sixth of the cost for (arguably better) observability.
Spending that much more money on just observability can easily kill the budget for building more agents or defund the budget for their observability — at which point you have agents in production with no visibility into whether they’re working correctly.
The build cost trap: $30K/month and climbing
Let’s talk about another cost scenario that unexpectedly emerges as orgs agentify quickly: the build cost trap.
Say an engineering team spends six months building three agents within an AI-assisted development environment. By the time the agents are in production, their observability token spend has hit $30K per month.
They had plans to ship 15 agents. If we do the very simple math and token spend scales linearly with agent count, they would be looking at $150K per month just in observability infrastructure costs — before they’d even written a line of application code for agents 4 through 15.
A team like this might come to Monte Carlo not because their DIY approach does not work technically, but because the cost trajectory makes it economically impossible to scale their product ambitions.
The build decision looks cheap at agent #1, but it becomes a cost ceiling pretty quickly.
The speed problem: 1.5 minutes vs. “where did this loop start?”
Lastly, let’s look at a bottleneck that isn’t just about raw token cost, but the actual speed of troubleshooting production failures.
A capable engineering team building DIY observability isn’t just pasting text into a raw LLM prompt; they’ll usually stand up a basic dashboard using OpenTelemetry or an ELK stack to track logs. But agents don’t behave like linear, deterministic software. They run asynchronous loops, spawn sub-agents, call external tools, and self-correct.
When an agent returns a broken or harmful output to a customer, a DIY team has to manually reconstruct the crime scene. An engineer has to open up multiple logs, stitch together the spans across disjointed execution steps, and try to guess where the reasoning went off the rails. Based on our observations, this manual trace reconstruction takes an average of 10 to 15 minutes per incident—assuming the on-call engineer intimately understands the agent’s architecture.
Compare this to a purpose-built Troubleshooting Agent, like Monte Carlo’s, which can automatically ingest the telemetry structure, map the asynchronous dependencies, and isolate the exact root cause—down to the specific tool call or corrupted data vector—in 1.5 minutes.
At a single-incident level, this looks like a minor efficiency gain. But when you scale to a fleet of 50 agents running thousands of parallel turns, those 15-minute manual investigations compound into massive production exposure, hours of wasted engineering time, and a soaring token bill from agents continuing to misbehave while you hunt for the bug.
Why DIY observability is structurally expensive
Understanding the cause differential of build vs. buy matters for how you think about which choice is right for your team.
Purpose-built observability for AI agents is designed around the telemetry that agents actually produce: traces, spans, tool call chains, latency signals, token counts at the individual step level. When you instrument an agent properly, you get structured telemetry that tells you exactly where the system spent its tokens, which steps failed, and why — without having to re-run the inference.
DIY observability almost always ends up in one of two failure modes:
The brute-force LLM evaluator. You pass the full conversation context — sometimes multiple turns deep — to a frontier model and ask it to evaluate whether the agent behaved correctly. This works, but it’s also extraordinarily token-intensive, because you’re paying to process a large context window every time you want to check something. And, as context windows grow across turns, costs escalate exponentially.
The underinstrumented fallback. Token costs get too high, so you are forced to reduce evaluation coverage. You sample 10% of interactions and only evaluate the expensive steps. You do a human review before release instead of automated monitoring in production. This is the path most DIY builds eventually take — and it’s the path that leaves you flying blind. Per our own research, 63% of teams currently rely primarily on human review to determine whether AI outputs are working correctly. That is not a scalable answer at fleet scale.
The irony of DIY observability: you build it to get visibility. Then the cost of running it forces you to reduce coverage. You end up with less visibility than you started with.
Purpose-built systems sidestep this because the evaluation logic is pre-optimized for the telemetry structure. You’re not asking a general-purpose LLM to understand what happened, you’re querying structured data that was captured at instrumentation time, with purpose-built evaluation models that are tuned for this specific domain.
The build vs. buy calculus
I’m not going to advise you that the answer is always buy. For small teams with one or two agents in production, building something lightweight is reasonable. You’ll learn what you actually need, and the token volumes will be manageable.
But there’s a threshold, and it comes earlier than most teams expect.

The inflection point is usually around the 5-agent mark, and it hits for three compounding reasons: token costs start showing up in budget reviews; the engineering team’s maintenance burden on the observability stack starts competing with shipping new agents; and coverage gaps start producing real incidents that take too long to investigate.
By the time a team recognizes all three problems simultaneously, they’ve usually spent several months and meaningful engineering resources on a system they’re about to replace.
What you should actually be monitoring
One more reason DIY observability gets expensive: teams often don’t know what to monitor until they’ve seen things go wrong. So they monitor everything, which is fine as a learning exercise, but devastating for token budgets when implemented at scale.
Based on what we’ve seen in production across hundreds of agent deployments, the signals that actually matter are:
Token count per span, not per session. Session-level token counts hide where the problem is. A planning step that’s consuming 80% of your tokens looks fine at the session level. Span-level visibility is how you tell the difference between a planning failure, a tool call failure, and a response generation failure. These are three different root causes with three different fixes.
Cost trend over time, not just point-in-time. A single expensive session is noise. A cost trend that’s climbing 15% week-over-week is a signal. Automated cost monitors that flag anomalies before they show up in your invoice are the difference between catching drift early and explaining a surprise budget overrun.
Behavioral drift, not just output quality. An agent can produce correct outputs while drifting from its intended workflow — taking extra steps, calling tools unnecessarily, and accumulating context in ways that inflate future costs. Output quality evals alone miss this, but trace-level behavioral monitoring catches it.
Rollback readiness. Nearly a third of organizations in our research say they could not disable or roll back a harmful AI agent within minutes. The ability to intervene quickly is itself a cost control mechanism; every minute a misbehaving agent runs in production is a minute of wasted tokens and potential downstream data damage.
The real cost of getting it wrong
Token economics are the most visible symptom of the DIY observability problem. But they’re not always the most serious one.
The more devastating problem is what happens when you have agents in production with inadequate visibility. Our research shows 61% of data leaders report that their established monitoring appeared normal while a critical data issue was actually occurring. In a pipeline world, that might be a bad data incident, but in an agentic world — where agents are making decisions, generating content, and taking actions — that’s a trust problem.
When agents are not monitored, they can quietly go wrong for an extended period of time. The problem is that they don’t throw errors like the deterministic software we are all used to. Instead, they return confident answers based on wrong or outdated data. They accumulate context drift over thousands of interactions until their behavior has shifted meaningfully from what was intended. And because no one is watching at the trace level, no one catches it until a customer or stakeholder notices.
Build vs. buy is ultimately a question about what you’re optimizing for. If you’re optimizing to learn quickly with one or two agents, it’s fine to build. If you’re optimizing to scale a fleet of agents reliably with the cost structure, coverage, and response time to match, the math simply doesn’t favor DIY.
Our promise: we will show you the product.