Agent Observability Overview 4: From Data to Judgment
This is the final article in the Agent Observability series. The previous three articles covered “why it’s hard,” “what to observe,” and “how to collect.” This article addresses the ultimate question: once you have the data, then what?
From “Being Able to See” to “Being Able to Judge”
The purpose of observability has never been to “collect data,” but to make judgments:
- Is this Agent’s behavior normal?
- Was this cost worth it?
- What was the root cause of this task failure?
- Is this Agent trustworthy?
If data can’t be transformed into judgments, then collecting more only accumulates “data debt.”
This article discusses several key scenarios for going from data to judgment.
1. Cost Attribution: Where Did the Money Go?
LLMs are billed by token. When an Agent runs, token consumption can be staggering.
Real cases I’ve seen:
- A “smart customer service” Agent consumed thousands of dollars worth of tokens in a day
- A “code assistant” Agent, due to getting stuck in a loop, consumed a month’s budget in minutes
- In multi-Agent collaboration scenarios, when the bill arrives, no one knows which Agent is responsible
Why Is Cost Attribution Difficult?
On the surface, cost attribution seems simple: just tally each Agent’s token consumption, right?
In practice, it’s far from simple:
Problem 1: Uneven Token Consumption
The cost of one LLM call = Input Tokens × Unit Price + Output Tokens × Unit Price
But token consumption varies dramatically between calls:
- Simple Q&A: tens of tokens
- Conversation with context: thousands of tokens
- Long document analysis: tens of thousands of tokens
Just counting “number of calls” isn’t enough; you must be precise to the token level.
Problem 2: Cost Spans Multiple Stages
The cost of one user task might be distributed across:
- Main Agent reasoning
- RAG retrieval calls (Embeddings cost money too)
- Intermediate reasoning in multi-Agent collaboration
- Failed retries
The user only sees “one task,” but behind it might be a dozen LLM calls.
Problem 3: Wasted Costs Are Hard to Identify
The most painful is wasted cost—Agent did a lot of work, but ultimately produced no effective result.
For example:
- Agent got stuck in a tool call loop, repeatedly trying the same failed operation
- Agent’s reasoning “went off track,” spending lots of tokens discussing irrelevant topics
- Agent requested too much context but only used a small portion
Cost Attribution in Practice
My approach in practice is stage-based metering:
Token consumption for one task =
Input (user input)
+ Context (RAG retrieval / conversation history)
+ Reasoning (reasoning / CoT)
+ Tool_Output (tool returns)
+ Response (final response)
Metering each stage separately has benefits:
- Accountability: If Context stage tokens explode, it indicates RAG retrieval strategy issues
- Optimization: If Reasoning stage tokens are too high, it indicates Prompt design needs improvement
- Alerting: Set thresholds for each stage; alert if exceeded
A useful metric is Token Efficiency Ratio:
Token Efficiency Ratio = Effective Output Tokens / Total Consumed Tokens
If this ratio is very low, it means the Agent is “thinking” a lot but “producing” little—possibly spinning idle.
2. Behavioral Baselines and Anomaly Detection
In security scenarios, another core requirement is anomaly detection: has the Agent’s behavior deviated from normal patterns?
Why Traditional Anomaly Detection Doesn’t Work Well for Agents?
Traditional service anomaly detection is usually based on statistical metrics:
- P99 latency suddenly increased → anomaly
- Error rate exceeded threshold → anomaly
- QPS suddenly dropped → anomaly
But Agent “anomalies” often don’t show up in these metrics:
- An Agent might have normal latency, no errors, but made wrong decisions
- An Agent might execute successfully, but accessed data it shouldn’t have
- An Agent might call “legitimate” tools, but the call sequence reveals malicious intent
Agent anomalies are semantic-level anomalies, not performance-level anomalies.
Establishing Behavioral Baselines
My practice is to establish multi-dimensional behavioral baselines for each Agent:
Dimension 1: Semantic Baseline
Convert the Agent’s historical Prompts and Responses to vectors through an Embedding model, calculate the “normal range” of vector distribution.
Normal state: Topics the Agent discusses cluster in a certain region in vector space
Anomaly signal: New interaction vectors suddenly deviate from this region
For example, a “code assistant” Agent normally discusses technical topics. If one day it starts frequently talking about “company finances” or “personnel information,” the vector space will show significant “semantic drift.”
Dimension 2: Tool Call Baseline
Track the Agent’s tool call patterns:
- Commonly used tool set
- Frequency distribution of tool calls
- Sequence patterns of tool calls (modeled with Markov chains)
Anomaly signals include:
- Calling a tool never used before
- Abnormal call frequency (suddenly high-frequency calls to a certain tool)
- Abnormal call sequences (rare tool combinations appear)
Dimension 3: Resource Consumption Baseline
Track the Agent’s resource consumption patterns:
- Token consumption rate
- API call frequency
- Task completion time
Anomaly signals include:
- Token consumption suddenly spikes (possibly stuck in a loop)
- Task duration abnormally extended (possibly under attack or encountering issues)
Comprehensive Risk Scoring
Multiple dimensions of anomaly signals need to be synthesized. I use a weighted scoring model:
Risk Score = α × Semantic Drift Score
+ β × Tool Call Anomaly Score
+ γ × Resource Consumption Anomaly Score
+ δ × High-Risk Operation Weight
Where High-Risk Operation Weight is hardcoded rules for specific dangerous behaviors. For example:
- Executing
rm -rf→ directly raises risk score - Accessing credential files → directly raises risk score
- Sending large amounts of data externally → directly raises risk score
Based on risk score, set tiered responses:
- Low risk (0-30): Log, incorporate into long-term profiling
- Medium risk (30-70): Trigger manual review
- High risk (70-100): Automatically isolate Agent, trigger alert
3. From Observability to Operability
The ultimate goal of observability isn’t “seeing system state clearly,” but making the system operable.
What does “operable” mean? My understanding is: being able to make decisions based on data and form closed loops.
Loop 1: Debugging Loop
When an Agent has issues, quickly locate the root cause:
Anomaly alert → View execution trace → Locate specific step → Analyze reasoning process → Discover Prompt issue → Modify Prompt → Verify fix
This loop requires:
- Complete execution trace tracking
- Queryable Prompt and Response
- Ability to “replay” historical tasks
Loop 2: Cost Loop
When costs exceed expectations, quickly find the cause and optimize:
Cost alert → Break down to Agent/task/stage → Discover RAG retrieval returning too much → Optimize retrieval strategy → Costs decrease
This loop requires:
- Fine-grained cost attribution
- Correlation between cost and business metrics (cost per unit of output)
- Before/after optimization comparison analysis
Loop 3: Security Loop
When anomalies are detected, quickly respond and harden:
Anomaly detection → Assess risk level → Automatic/manual handling → Post-incident analysis → Update baseline/policy → Prevent similar issues
This loop requires:
- Real-time anomaly detection capability
- Automated handling measures (isolation, blocking)
- Dynamic baseline update mechanism
The Key to Operability: Data → Insight → Action
Connecting these three loops, we see a common pattern:
Data collection → Data processing → Insight generation → Decision support → Action execution → Feedback learning
The value of an observability system is reflected in every link of this chain:
- Data collection: Content from previous articles
- Data processing: Correlation, aggregation, noise reduction
- Insight generation: Baseline comparison, anomaly detection, attribution analysis
- Decision support: Risk assessment, optimization recommendations
- Action execution: Alerting, blocking, approval
- Feedback learning: Baseline updates, policy optimization
Only by completing this loop can observability truly become “operable.”
Series Summary
With this, the Agent Observability series comes to a close. Let’s review the core points from the four articles:
Article 1: Why Agents Make “Visibility” Harder
- Agent execution paths are probabilistic, unpredictable
- Traditional monitoring can only answer “what was done,” not “why”
- Audit, runtime, and network operate independently, lacking correlation
Article 2: How I Understand Agent Key Objects
- Four core objects: Agent, Model Source, Tool, Agent-to-Agent Link
- Two observation dimensions: Asset topology (static) + Execution trace (dynamic)
- Three link types: A2L, A2T, A2A
Article 3: Practical Choices for Collection and Reconstruction
- Three collection paradigms: SDK instrumentation, network proxy, runtime observation
- Misalignment between audit and runtime needs correlation mechanisms to bridge
- Reconstructability > Full collection
Article 4: From Data to Judgment
- Cost attribution requires stage-based metering and efficiency analysis
- Anomaly detection requires multi-dimensional behavioral baselines
- The goal of observability is “operability,” requiring closed loops
Observability in the Agent era is indeed more complex than the traditional microservices era. But the good news is many fundamental principles are transferable—they just need adaptation for Agent characteristics.
I hope this series provides some inspiration for those working on Agent observability. If you have different thoughts or practical experiences, I welcome the exchange.
Previous: Agent Observability Overview 3: Practical Choices for Collection and Reconstruction
Comments