Deterministic agent evals
Agent Eval Bench
Run repeatable YAML suites against Claude, Codex, and OpenAI agents. Keep local CLI control today, then add hosted history and regression checks when the workflow needs a shared dashboard.
Suite status
dd-agent-regression.yaml
8 cases across prompt, JSON, and tone checks
PR summary word capregex plus contains scorer
PASSMessy email extractionjson-shape scorer
PASSLaunch post rewriteLLM judge threshold
WATCHYAML
Portable suites that live next to product prompts, fixtures, and release checks.
4 scorers
Contains, regex, JSON shape, and judge prompts cover most agent regression checks.
History
Hosted runs preserve pass rates, failures, and case details for team review.