Part 3: Production

Chapter 16: Observability and Monitoring

Once your agent is in production, you need visibility into what it's doing.

What to Monitor

Conversations: What are users asking? Where do conversations succeed or fail?
Agent behaviour: Which tools are being used? What reasoning paths is the agent taking?
Performance: Latency, token consumption, error rates
Quality signals: User feedback, escalation rates, conversation abandonment

Tracing

Tracing lets you see exactly what happened in a conversation — the full sequence of reasoning, tool calls, and outputs. Essential for debugging problems.

A good trace shows:

The user's input
The agent's reasoning at each step
Tool calls and their results
The final output
Timing for each step

Alerting

Don't just collect data — act on it. Set alerts for:

Error rate spikes
Latency increases
Unusual patterns (e.g., sudden increase in escalations)
Safety-related triggers

Key Principle

Observability is how you turn "it's not working" into "here's exactly why it's not working." Invest in good observability early — it pays dividends when debugging production issues.