Skip to content
AI Observability Updated Jun 01 2026

Axios at Snowflake Summit: Building a Culture of AI Trust with Monte Carlo

AUTHOR | Virna Sekuj

Axios is one of the most AI-forward newsrooms in the country. But when Kellyanne Smith, Director of Engineering, and Shreye Saxena, Senior Data Scientist, took the stage at Snowflake Summit, they didn’t open with what was working. They opened with a challenge their teams faced and how they solved for it.

The lesson wasn’t about building AI. It was about learning to trust it.

The problem that started everything

Content tagging sounds unglamorous. But for a media company like Axios, it’s load-bearing infrastructure.

Every article published needs to be categorized — mapped to a taxonomy that determines which section of the site it appears under and which ads run against it. Done manually, it’s what Shreye called “hunting and pecking.” It’s time-consuming, error-prone, and completely human-dependent.

LLMs were an obvious fit. So the team built an auto-tagging system and shipped it. And the question that came back from the newsroom, over and over, was: “How do we know the AI is doing what we want it to do? How do we know it’s doing what it’s supposed to do?”

That question would define everything that came next.

The first attempt at an answer

In mid-2024, Axios built what Shreye described as an early-stage automated evaluation framework. Every OpenAI request for auto-tagging ran through a second LLM call — an evaluator that scored the output on a scale of 1 to 5. Results surfaced in a dashboard the team checked roughly once a month.

It was a real step up from gut-feel. But the problems were obvious in retrospect.

The team was reactive, not proactive — learning about issues after the fact, not when they happened. They’d effectively doubled their token costs by making every request twice: once to do the task, once to grade it. And the framework was optimizing for an internal policy system, not for the question the newsroom actually cared about: is the AI doing what we want?

What trust actually requires

In early 2025, Axios started a design partnership with Monte Carlo to advance their observability maturity. They’d been using Monte Carlo for traditional data monitoring — Airflow failures, data quality changes — and saw an opportunity to bring that same rigor to their AI systems.

Out of that work, the team codified what trust actually means for an AI system in production. Four questions:

  1. Is the agent retrieving the right context — and is that context correct?
  2. Is the agent performing efficiently, both in terms of request latency and token cost?
  3. Is the agent behaving as intended — consistently, from development through production?
  4. Are the outputs fit for purpose?

That last question is subtler than it sounds. Shreye put it plainly: “It’s one thing to say a story about the White House is tagged with ‘President of the United States.’ Is that useful for classifying a section? Is that useful for targeting?” Accuracy and usefulness aren’t the same thing. The evaluation framework has to reflect the actual goal.

Building the infrastructure to answer those questions

By mid-2025, Axios had rebuilt their observability stack. OpenTelemetry collectors captured spans. Monte Carlo served as the orchestration layer on top of their first-party data warehouse. Instead of a single 1-to-5 score, evaluations now tracked multiple dimensions — task completion, helpfulness, groundedness, accuracy — as time-series data with anomaly detection.

That shift from score to time series was significant. “When you’re using LLM-as-judge evaluations, it can sometimes be really difficult to know that a score is sufficiently wrong to take mitigation action,” Shreye explained. Monte Carlo’s anomaly detection made that call automatic — the same capability Axios already used for data quality checks, now applied to agent outputs.

The mitigation loop changed too. The goal wasn’t just knowing something went wrong. It was reducing the time between something going wrong and someone fixing it — whether that meant updating a prompt, switching models, or revisiting the evaluation framework itself.

Trust isn’t a tool. It’s a mindset.

Axios now has agent observability running across the auto-tagging system, with more use cases being brought into production. But the framing Shreye left the room with wasn’t about the stack.

“By adopting agent observability philosophies within our workflows, we knew that it was not just a tool fit for our needs, but a series of processes, systems, and mindsets that let us build trust within our organization — that we knew AI was going to do what we wanted it to do, whether it was auto tagging as our first AI use case or the dozens of use cases that we’ve applied since then.”

Trust, it turns out, is infrastructure. And it has to be built deliberately.

Want to see how Axios built their agent observability stack? Read the full case study. Or schedule a demo to see Monte Carlo’s Agent Observability platform in action.

Our promise: we will show you the product.