Part 3: Production

Chapter 15: Evaluation and Testing

You can't improve what you can't measure. Evaluation ("evals") is how you systematically assess agent quality and catch regressions.

Types of Evaluation

1. Functional Evals

Does the agent do what it's supposed to do? Does it use the right tools, follow its instructions, escalate when it should?

2. Quality Evals

Is the output good? Is it accurate, complete, clear, and appropriately toned?

3. Safety Evals

Does the agent avoid harmful outputs? Does it refuse inappropriate requests and stay within scope?

4. Performance Evals

Does it work efficiently? Response latency, token usage, error rates.

Building an Eval Suite

Start simple and expand:

Collect real examples — Gather actual user interactions
Define criteria — What does "good" look like for each scenario?
Create test cases — Cover common scenarios and edge cases
Automate where possible — Run evals as part of your deployment pipeline
Review failures — Understand root causes to improve the agent

Practical Tip

Your eval suite is a living asset. As you discover new failure modes in production, add them as test cases. The suite should grow over time to reflect real-world challenges.