Skip to content
AI Observability Updated Jun 03 2026

Integrating trust into the Agentic Control Plane: Observations from Snowflake Summit

AUTHOR | Lior Gavish

At Snowflake Summit this week, Sridhar Ramaswamy laid out a four-component model for the agentic enterprise: data and content, AI models, applications, and an agentic control plane that ties them together.

It’s a useful frame. And I want to build on it — because I think the control plane concept is right, but most teams are going to implement half of it and wonder why their agents keep failing in production.

The three layers that came before

Before getting to the control plane, it’s worth pausing on the first three components, because each one has a maturity problem that doesn’t get resolved by adding orchestration on top.

Data. Sridhar called it the “most defensible moat” — your customers, your operations, your product data. He’s right. But enterprise data estates are rarely as unified as the keynote made them sound. Most organizations are still dealing with pipelines that silently break, tables that go stale without anyone noticing, and — more insidiously — multiple conflicting definitions of the same metric across different teams and systems.

That last problem is worth dwelling on, because agents make it dramatically worse than it was before.

A skilled analyst working across nine sources of truth builds intuition over months. They know which definition of “active customer” the finance team uses versus the one in the CRM. They ask a colleague. They check the data dictionary. They navigate ambiguity through tribal knowledge accumulated over time.

An agent has none of that. It gets whatever context is in scope on a given invocation. Ask it the same question ten times and it might hit a different source each time, producing a different answer each time — not because the model is broken, but because the data underneath it is semantically inconsistent. I’ve had customers describe this exact failure: built a data agent, asked it the same question ten times, got ten different answers. Weeks of prompt engineering later, the root cause was still in the data layer. Nine sources of truth, no canonical owner, no semantic layer.

This matters for how you sequence your investments. Prompt engineering on top of ambiguous data is polishing a cracked foundation. The teams with the most consistent production agent outputs are almost always the ones who resolved data semantics before they started building agents — not the other way around. A canonical semantic layer that defines metrics once and exposes them through canonical views isn’t an AI project. It’s a data project. It just turns out to be a prerequisite for everything else.

Models. The pace of release cycles right now is genuinely remarkable — new frontier models every few months, open-source models rapidly closing the gap on closed ones. But as Sridhar noted, the model is not where you build lasting advantage. It’s a commodity that improves over time. What matters architecturally is whether you’ve built your system so you can swap models when the cost-to-capability ratio shifts — which it will, repeatedly, on a timeline you can’t predict. Teams that hardcoded a specific model into their application layer are discovering that architectural debt now.

Applications. The MCP announcement is significant. An open standard for connecting AI systems to the applications your business runs on — Salesforce, Workday, ServiceNow, Slack — solves a real integration problem. But connectivity isn’t governance. And in production, MCP as currently implemented is a thin proxy: when a tool call is slow, the root cause is almost always the backend. MCP has no intelligence to retry, degrade gracefully, or surface what went wrong. The control plane narrative implies coordination, but connectivity without observability at the tool call level is just surface area for failure with no mechanism to detect it.

What the control plane actually needs to do

The control plane framing — a coordination layer that sits across data, models, and applications — is the right mental model for where enterprise AI architecture is heading. Where I’d push on the implementation is here: orchestration and trust are two different things, and most teams are building one and calling it both.

Orchestration operates at the moment of execution. It routes requests, manages context, coordinates multi-agent workflows, and enforces action guardrails. All necessary. But orchestration tells agents what to do. It doesn’t continuously verify whether they’re doing it correctly.

That distinction matters more than it might seem, because of a problem I’d call unknown unknowns. Your evals are hypothesis-driven — you test for the failure modes you already know about. You write test cases based on what you think could go wrong. But the failures that actually hurt you in production are the ones you didn’t know to test for: the edge case user behavior nobody anticipated, the data distribution shift that changed agent outputs silently, the prompt update that fixed one thing and broke another in a path nobody exercised in staging.

Production observability catches what evals can’t, because it watches real behavior against a continuous baseline rather than a predefined test set. The question it answers is not “does this pass my tests?” — it’s “has something changed from how this was behaving before?” Those are different questions, and you need both.

The trust layer answers four questions continuously, for every agent in production:

Is the agent retrieving the right context — and is that context correct? This is where most production failures originate. Not the model, not the prompt — the data the agent is reasoning on. If a data pipeline breaks and nobody has connected the observability of that pipeline to the agents consuming it, the agent produces confident wrong answers indefinitely. I watched this happen on a customer call recently: an operations agent was asked about data health, responded that everything was fine. The underlying pipeline had been broken for six weeks. Not a hallucination — the model did exactly what it should. The context was wrong, and the agent had no mechanism to detect it.

Is the agent performing efficiently? Token cost and latency are invisible without span-level instrumentation. An agent that’s economically viable at a hundred requests per day may be completely unviable at ten thousand — not because it breaks, but because the cost curve doesn’t hold. Most teams discover this after committing to the architecture.

Is the agent behaving as intended? Model providers silently update underlying models. Prompt changes intended to fix one behavior introduce regressions elsewhere. Data distributions shift and agent behavior shifts with them, even when the prompt and model are identical. A behavior you validated in staging is not guaranteed to be the behavior running in production six months later. Without a continuous behavioral baseline, you won’t know until a user tells you.

Are the outputs fit for purpose? This is subtler than correctness. A news organization’s auto-tagging agent correctly identified that an article mentioned a specific legislator — and tagged it with geopolitical conflict topics when it should have been tagged with domestic policy. Correct fact. Wrong classification. Generic LLM-as-judge evaluations score helpfulness and groundedness. They don’t know your taxonomy. Evaluations that are fit for purpose require operationalizing domain knowledge, not running a generic scorer.

The architecture consequence

The control plane needs two sub-layers. An orchestration layer that coordinates action — routing, context injection, cross-agent state, guardrails. And a trust layer that continuously monitors the four questions above, with span-level telemetry tied back to the underlying data, time-series anomaly detection that surfaces drift before users notice it, and evaluation pipelines grounded in your actual business requirements.

One thing I’d add to Sridhar’s model: the data layer and the trust layer are not independent. The trust layer that watches your agents has to share a lineage model with the data layer feeding them. When a pipeline breaks, every agent downstream of that pipeline should surface a freshness warning before it responds — not as a prompt instruction, but as an observability check that fires before synthesis happens. Connecting these two layers is what closes the loop between data health and agent reliability. Without it, you’re monitoring two separate things that are actually the same failure.

Orchestration without a trust layer is a system that coordinates failures more efficiently. You route the wrong answer faster. You scale the misclassification. You propagate the stale data to more users more consistently.

Build both halves.

Our promise: we will show you the product.