Agent evaluation uses three grader types: code-based, model-based, and human graders¶


Source type	Official Docs
Confidence	Medium
Last verified	2026-03-25
Topics	`claude-code`

Insight: Agent evaluation requires three grader types: code-based (fast, objective, brittle to valid variations), model-based (flexible, non-deterministic, requiring calibration), and human graders (gold-standard but expensive). Evaluation strategies vary by agent type: coding agents use deterministic tests and code quality rubrics; conversational agents combine state verification with interaction quality; research agents use groundedness and coverage checks; computer use agents verify outcomes in sandboxed GUI environments. Start with 20-50 real tasks, design unambiguous problems with reference solutions, balance positive and negative cases.

Detail: Maintain isolated environments and regularly review transcripts to ensure graders function properly. Teams should choose grader types based on task requirements rather than assuming all-or-nothing evaluation.

Sources

Anthropic — "Demystifying evals for AI agents" (2026-01-09)