TABLE OF CONTENTS
FEATURED
AI Agent Observability in Production: Beyond Logs and Latency
Mubbashir Mustafa
8 min read
Your Datadog dashboard won't tell you when an AI agent starts making bad decisions. It will tell you the agent is running, responding within latency thresholds, and not throwing errors. All of that can be true while the agent is confidently recommending the wrong products, citing outdated policies, or gradually expanding its tool use beyond its authorized scope. Traditional application performance monitoring was built for deterministic systems. AI agents are probabilistic systems, and observability for probabilistic systems requires fundamentally different instrumentation.
CIO Magazine reported that "agentic AI systems don't fail suddenly; they drift over time." That framing captures the core challenge. A web server either responds correctly or it throws an error. An AI agent can respond fluently, confidently, and incorrectly, and nothing in a traditional monitoring stack will flag it. The 90% of organizations that remain in pilot mode for agentic AI, according to industry surveys, are stuck in part because they can't observe what their agents are doing well enough to trust them in production. Learn more
What Traditional APM Misses
Application performance monitoring tracks three things well: availability (is the service up?), latency (how fast does it respond?), and error rates (how often does it fail?). These metrics are necessary for agent observability but nowhere near sufficient.
An agent that's available, fast, and error-free can still be producing outputs that are factually wrong, contextually inappropriate, or operationally dangerous. A customer service agent that responds in 200 milliseconds with a confident, well-formatted, completely incorrect answer about your return policy is a worse outcome than a 500-error that routes the customer to a human. The error is visible and immediately addressable. The incorrect confident response propagates misinformation and erodes customer trust without anyone noticing until customer complaints accumulate.
Traditional APM also misses the multi-step nature of agent execution. A single agent interaction might involve: receiving a user query, retrieving context from a knowledge graph, constructing a prompt, calling a model, parsing the response, deciding to call a tool, executing the tool call, interpreting the result, generating a final response, and logging the interaction. Each step can succeed individually while the overall interaction fails. The model call returns a valid response, but the response is based on retrieved context that was stale. The tool call executes successfully, but the agent misinterpreted the result. APM sees green across the board. The user sees a wrong answer. Learn more
The Agent Observability Stack
Enterprise-grade agent observability requires four layers that don't exist in traditional monitoring: reasoning traces, behavioral baselines, cognitive drift detection, and cost attribution.
Reasoning traces capture not just what the agent did but why it did it. A reasoning trace records the chain from input to output: what context was retrieved, how the prompt was constructed, what the model returned, what tool calls were made, and how the agent arrived at its final response. This is the agent equivalent of distributed tracing in microservices. Without it, debugging an incorrect agent output requires guessing which step in the chain went wrong.
OpenTelemetry is emerging as the standard for agent telemetry, extending its existing trace and span model to cover LLM calls, retrieval operations, and tool invocations. The GenAI semantic conventions in OpenTelemetry define standardized attributes for model name, token counts, temperature settings, and completion content. Organizations that instrument their agents with OpenTelemetry today gain compatibility with the growing ecosystem of observability platforms (Langfuse, Arize, LangSmith, Datadog's LLM Monitoring) while maintaining the ability to query raw trace data for custom analysis.
Behavioral baselines establish what "normal" looks like for each agent. An agent's baseline includes its typical response patterns (length, tone, confidence level), its tool use patterns (which tools it calls, how frequently, in what order), its resource consumption (tokens per interaction, cost per task, latency distribution), and its output quality (accuracy rates, user satisfaction, escalation frequency). The baseline is not static. It should update as the agent's workload, data sources, and model versions change. But updates should be controlled: a sudden shift in tool use patterns should trigger an investigation, not an automatic baseline adjustment.
Cognitive drift detection monitors whether an agent's behavior is changing in ways that weren't caused by intentional updates. Four types of drift matter in production. Goal drift occurs when the agent's outputs start optimizing for something other than its intended objective. A sales support agent might drift toward longer, more detailed responses that impress in demos but slow down actual sales conversations. Context drift occurs when the underlying data sources change in ways that shift the agent's behavior. A knowledge base update that introduces contradictory information can cause an agent to produce inconsistent outputs. Reasoning drift occurs when model updates or prompt changes subtly alter the agent's decision patterns. A model provider's silent update (a common occurrence) can change how the agent weighs competing information. Collaboration drift occurs when changes in connected systems or other agents alter the agent's operating environment. An API change in a downstream system that modifies response formats can cause an agent to misparse tool outputs. Learn more
Cost attribution tracks resource consumption per agent, per team, per use case, and per interaction. Without cost attribution, AI spend is a single line item that grows opaquely. With it, you can identify which agents are cost-efficient, which are consuming disproportionate resources, and where routing optimizations would yield the greatest savings. Cost attribution also enables chargeback models where business units pay for the agent resources they consume, creating natural cost discipline across the organization.
Detecting Drift Before Users Do
The difference between reactive and proactive agent observability is drift detection. Reactive observability waits for user complaints, escalation spikes, or audit failures to indicate a problem. Proactive observability detects behavioral changes before they impact users.
Drift detection works by continuously comparing current agent behavior against the established baseline. Statistical approaches measure distribution shifts in output characteristics: response length, confidence scores, tool call frequency, and topic distribution. Semantic approaches use embedding-based comparison to detect when the meaning of agent outputs shifts even if the surface characteristics (length, format, tone) remain stable. An agent that starts recommending a different product category without any configuration change would be caught by semantic drift detection but might slip past statistical monitoring.
The practical implementation combines automated detection with human review. Automated systems flag potential drift events and classify them by severity. Low-severity drift (a 5% change in average response length) generates a log entry. Medium-severity drift (a new tool being called that wasn't in the agent's baseline) generates an alert to the agent's owner. High-severity drift (a sudden spike in PII appearing in agent outputs) triggers an automatic pause and incident response. Learn more
The alert thresholds require tuning. Set them too tight, and the team drowns in false positives. Set them too loose, and real drift goes undetected. Most organizations start with loose thresholds and tighten them as they accumulate data about normal behavioral variation. The first month of monitoring generates the baseline. The second month calibrates the thresholds. By month three, the system should be producing actionable alerts with an acceptable false positive rate.
From Observation to Control
Observability without the ability to act on what you observe is monitoring theater. The observability stack should connect directly to the governance and deployment infrastructure so that detected issues trigger appropriate responses.
When drift detection identifies a degraded agent, the system should be able to reduce the agent's autonomy (routing high-risk actions through human review), switch the agent to a different model version (if the drift correlates with a model update), restrict the agent's tool access (if the drift involves unexpected tool calls), or pause the agent entirely and redirect traffic to a fallback. These responses should be configurable per agent and per drift type. A customer-facing agent might pause on any significant drift. An internal analysis agent might tolerate wider behavioral variation.
Incident response for agents differs from traditional incident response in one critical way: the agent's "fix" might not be a code change. If an agent's behavior changed because its underlying data changed, the fix is in the data, not the agent. If behavior changed because a model provider updated their model, the fix is model pinning or routing to a different provider. If behavior changed because a connected system modified its API responses, the fix is in the integration layer. The observability system needs to capture enough context to distinguish between these root causes. Learn more
Cost control is the other dimension where observability drives action. When cost attribution reveals that a single agent is consuming 30% of the model API budget, the response might be routing its low-complexity requests to a cheaper model, reducing its context window size, implementing caching for repeated queries, or rearchitecting the agent to make fewer model calls per interaction. Observability provides the data. The platform provides the levers.
The feedback loop between observability and improvement is what separates operational agent programs from perpetual pilots. Every drift event, every cost spike, every quality degradation is data that feeds back into the agent's design. Teams with strong observability iterate faster because they can see what's working and what isn't. Teams without it make changes blindly and hope for the best. Over six months, the observability-driven team ships ten agent improvements backed by data. The blind team ships three improvements and two regressions they didn't detect.
Building the Observability Layer
The practical question for most organizations is whether to build agent observability into their existing monitoring stack or deploy a purpose-built platform.
The existing-stack approach adds agent-specific instrumentation to Datadog, New Relic, or Grafana. The advantage is that teams already know these tools, and agent metrics appear alongside application metrics in familiar dashboards. The limitation is that these platforms lack native support for reasoning traces, drift detection, and semantic analysis. You end up building custom dashboards, custom alerting rules, and custom analysis pipelines on top of a platform designed for different problems.
The purpose-built approach uses platforms like Langfuse, Arize, or LangSmith that are designed for LLM and agent observability. These platforms provide native reasoning trace visualization, drift detection, prompt versioning, and output quality evaluation. The limitation is that they add another tool to the monitoring stack, and integration with your existing alerting and incident management workflows requires additional engineering. Learn more
Most enterprises at scale will need both. General-purpose APM for the infrastructure that agents run on (containers, APIs, databases) and purpose-built observability for the agent behavior layer (reasoning, drift, quality, cost). The key is ensuring that both layers can correlate events. When an agent's output quality degrades, you need to determine whether the cause is infrastructure (the database is slow), data (the knowledge base is stale), model (the provider updated the model), or agent logic (a recent prompt change had unintended effects). That correlation requires both layers sharing a common trace context.
The investment in agent observability pays for itself through three channels: faster incident resolution (finding the root cause of agent issues in minutes rather than days), proactive drift prevention (catching behavioral changes before users complain), and cost optimization (identifying and eliminating waste in model consumption). Organizations that treat observability as optional for agent deployments consistently find that the cost of not observing, in incident response time, user trust erosion, and compliance gaps, exceeds the cost of building the observability layer by an order of magnitude. Learn more
AI agents need observability built for probabilistic systems: reasoning traces, cognitive drift detection, and cost attribution. Rebase embeds agent observability into the infrastructure layer so you know what every agent is doing, why, and how much it costs. See it in action: rebase.run/demo.
Related reading:
Agentic AI Infrastructure: The Complete Stack
AI Agent Governance Framework
AI Agent Security Posture: From Risk to Control
Why Your AI Agents Hallucinate
Securing AI Agent Tool Use
Ready to see how Rebase works? Book a demo or explore the platform.



