Skip to content
AI Observability Updated Jan 26 2026

How To Fix Slow, Costly AI Agents

How to fix slow, costly ai agents
AUTHORS | Michael Segner | Elor Arieli

Building an AI agent is hard. Deploying an agent with latency acceptable to your users and a token count acceptable to your CFO is even harder.

Most agent monitoring strategies start with operational performance for good reason. Nothing will crater adoption like the pinwheel of a slow loading system.

Wait, where did all my users go?

These AI systems are emerging technologies with nascent stacks, but with enterprise-grade expectations. Monte Carlo has deployed three agents into production with more than 30,000 users, believe us, we will hear about it if they are slow.

Here is how we are monitoring the operational performance of our agents and the steps we take to debug them when token count and latency become issues.

Step 1: Instrument agent telemetry

Imagine you were trying to figure out how to make your car trips as efficient as possible. You have information that on a certain day you traveled for an hour at 30 miles per gallon. 

Could you have made the trip more quickly and more efficiently? Who knows! You have no context for where you were going, why, and what streets you took to get there.

It’s the same with agents, if you don’t collect and organize the telemetry in a meaningful way, you are flying blind. You can see the LLM calls that are being made, but without any idea why or as part of what task.

Instrument agent telemetry

Instrumenting agent telemetry can be quick and painless. You can even get started with our open source SDK that leverages the Open Telemetry framework to bring this telemetry to your environment. 

Not only will you see which calls are associated with which workflows, but you can capture other important context such as your inputs, outputs, model version, and much more. 

This is great for visibility and can also be used for deeper monitoring and evaluations (which we will discuss in a future post).

Step 2: Set KPIs and monitors based on the use case

The next step is to determine what level of performance is acceptable for your agent. It will vary by use case. 

Generally users will wait longer and more budget will be available based on the value of the output or task being accelerated. You should also determine if your agent has relatively stable behavior, or if it deals with highly variable work.

Here are some examples from real teams:

  • Dropbox measures the overall latency of their agent, specifically if 95% of their responses have a latency less than 5 seconds. 
  • The LinkedIn team monitors the latency of the 90th percentile of their Hiring Assistant’s  runtimes to ensure it doesn’t cross a specific threshold.
  • A revenue operations platform ensures 90% of their responses don’t exceed 5 seconds for time- to-first token. Time to first token is a common metric as users will often tolerate longer completion times if they can start seeing and digesting the start of a response (ChatGPT isn’t typing, it’s buying time).

Some teams will also define time budgets: hard timeouts and retry limits for both LLM calls and tools to ensure latency stays bounded even when the agent encounters unexpected conditions.

Set KPIs and monitors for AI agent
Setting a latency monitor for our agent. Source: Monte Carlo Agent Observability.

If you want to dive deeper into how to measure your agent’s reliability, check out our recent post, “Why you can’t answer ‘how reliable is this agent?’.”

Step 3: Identify and begin exploring outliers

This week the average latency for our Monitoring Agent was about 18 seconds and for our Troubleshooting Agent it was about 36 seconds. 

Not bad, considering manually defining monitors can take up to 5 minutes and troubleshooting an incident takes up to 19 hours on average!

But, it is helpful to look at the top 5 to 10% worst performing runs to identify potential edge cases or initial signs of degradation. For the Troubleshooting Agent, we’ve quickly isolated a few requests below that took over 4 minutes to complete.

Identify AI agent outliers
Being able to filter and isolate your longest running or most token consuming traces is very helpful for debugging your agent. Source: Monte Carlo Agent Observability.

Let’s take a closer look at one of these traces!

Step 4: Find the “bad” spans in the trace

Find bad spans in the trace of AI agent

Diving deeper into a specific trace. Source: Monte Carlo Agent Observability.

It’s rare that you will find a latency or token issue caused by a series of spans that have taken a bit longer than usual. It’s almost always one or two bad apples. 

Below we can see that most of the agent’s tasks before making its first LLM call were performed in less than a second. However, the “grab_upstream_field_anomaly.task” took 77% of the total runtime. This is all before the first call and a likely bottleneck.

Compare this to the “pre_summary_reasoning,” task, which took more than 30 seconds, but on the other hand this is a LLM call processing a considerable amount of information and thus some latency is expected.

Step 5: Look at similarities and differences in the telemetry

When there is a performance issue, it often boils down to a failure in one of these three areas:

  • Data (context engineering): You have provided too much, or too little information to the agent. This can often be determined by looking at the inputs, outputs, and associated token count. If the token count is high but token per second (throughput) is also average or high, this is a good indicator that you are overwhelming your agent. Consider strategies such as batching, sampling, or compaction (but be aware of the tradeoffs).
  • System (infrastructure): You have a bottleneck that needs to be optimized or a tool call that is failing. You may have long latency and low token counts indicating you may be getting queued or throttled if you are using a cloud provider or that you haven’t provisioned the GPU clusters just right if you are self hosting.

Reviewing the agent graph can be helpful to determine if there are certain tasks that are holding up others that can be run in parallel.

Failure codes can indicate failed tool calls. If a certain model is always associated with outlier traces, you may need to pick a smaller, faster model.

  • Code (agent framework, prompts): These can be difficult to diagnose and detect. Sometimes you get lucky and spot a small change in the system prompt or agent code that corresponds with performance degradation. You should also be on the lookout for specific trajectories, for example which tool call comes after which span and where did the error occur?

Most of the time, you run into an edge case that you didn’t anticipate. With non-deterministic systems and users often providing requests in open ended chat boxes, edge cases happen a frightening amount. This is one of the reasons agent observability is so crucial. 

In the specific case, when you dive into the high latency traces of the Troubleshooting Agent for this week you notice a few things (which unfortunately we have to blur some out):

  • All of the traces relate to the same customer;
  • The get data lineage task also is longer than normal indicating a complex environment;
  • Most of the abnormally long spans (but not all) are the same task; and
  • All of the outputs of the abnormally long spans are identical even when the task is not. “No access to X.” Essentially, the agent is struggling to complete the task because this particular customer has turned off a necessary permission and the agent is attempting to continue its task in a futile loop.
Identify issues in the AI agent's performance

Looks like we have a permissions issue creating a loop! Source: Monte Carlo Agent Observability.

So we have an edge case. This could be a case where we need to more closely define retry limits for our agent.

Next: Debugging agents for poor outputs

Debugging agent performance is hard. Even with strong telemetry, understanding why an agent is slow or expensive requires digging through traces, spans, and edge cases. As we’ve shown, agent observability is the foundation that makes this kind of debugging possible at all.

But as challenging as it is, debugging agents when their output is a hallucination or not fit for use is even harder. 

That problem deserves its own deep dive, be sure to follow our channels for the next post!

Learn more about Agent Observability at Monte Carlo here.

Our promise: we will show you the product.

Frequently Asked Questions

How to improve AI agent performance?

Start by instrumenting agent telemetry to collect detailed data. Set clear KPIs and monitors based on your use case. Regularly look for outliers in performance data, identify bottlenecks in traces, and compare telemetry for patterns. Address issues in data, infrastructure, or code.

How to make AI agent faster?

Collect and organize telemetry to understand where delays occur. Set latency targets and monitor against them. Find and fix slow spans or bottleneck tasks in your agent’s workflow. Optimize data handling, infrastructure, and agent code as needed.

Why is my AI agent so slow?

Common reasons include bottlenecks in specific tasks, providing too much or too little context, infrastructure issues, or inefficient code. Sometimes one task or span takes up most of the runtime. Reviewing telemetry helps pinpoint the exact cause.