Part 3: Production
Chapter 16: Observability and Monitoring
Once your agent is in production, you need visibility into what it's doing.
What to Monitor
- Conversations: What are users asking? Where do conversations succeed or fail?
- Agent behaviour: Which tools are being used? What reasoning paths is the agent taking?
- Performance: Latency, token consumption, error rates
- Quality signals: User feedback, escalation rates, conversation abandonment
Tracing
Tracing lets you see exactly what happened in a conversation — the full sequence of reasoning, tool calls, and outputs. Essential for debugging problems.
A good trace shows:
- The user's input
- The agent's reasoning at each step
- Tool calls and their results
- The final output
- Timing for each step
Alerting
Don't just collect data — act on it. Set alerts for:
- Error rate spikes
- Latency increases
- Unusual patterns (e.g., sudden increase in escalations)
- Safety-related triggers
Key Principle
Observability is how you turn "it's not working" into "here's exactly why it's not working." Invest in good observability early — it pays dividends when debugging production issues.
