Agent evaluation uses three grader types: code-based, model-based, and human graders¶
Insight: Agent evaluation requires three grader types: code-based (fast, objective, brittle to valid variations), model-based (flexible, non-deterministic, requiring calibration), and human graders (gold-standard but expensive). Evaluation strategies vary by agent type: coding agents use deterministic tests and code quality rubrics; conversational agents combine state verification with interaction quality; research agents use groundedness and coverage checks; computer use agents verify outcomes in sandboxed GUI environments. Start with 20-50 real tasks, design unambiguous problems with reference solutions, balance positive and negative cases.
Detail: Maintain isolated environments and regularly review transcripts to ensure graders function properly. Teams should choose grader types based on task requirements rather than assuming all-or-nothing evaluation.
Sources
- Anthropic — "Demystifying evals for AI agents" (2026-01-09)