Deterministic agent evals

Agent Eval Bench

Run repeatable YAML suites against Claude, Codex, and OpenAI agents. Keep local CLI control today, then add hosted history and regression checks when the workflow needs a shared dashboard.

Suite status

dd-agent-regression.yaml

8 cases across prompt, JSON, and tone checks

PR summary word capregex plus contains scorer
PASS
Messy email extractionjson-shape scorer
PASS
Launch post rewriteLLM judge threshold
WATCH

YAML

Portable suites that live next to product prompts, fixtures, and release checks.

4 scorers

Contains, regex, JSON shape, and judge prompts cover most agent regression checks.

History

Hosted runs preserve pass rates, failures, and case details for team review.