Deterministic agent evals

Agent Eval Bench

Run repeatable YAML suites against Claude, Codex, and OpenAI agents. Keep local CLI control today, then add hosted history and regression checks when the workflow needs a shared dashboard.

See pricing CLI on GitHub

Suite status

dd-agent-regression.yaml

8 cases across prompt, JSON, and tone checks

PR summary word capregex plus contains scorer

PASS

Messy email extractionjson-shape scorer

PASS

Launch post rewriteLLM judge threshold

WATCH

YAML

Portable suites that live next to product prompts, fixtures, and release checks.

4 scorers

Contains, regex, JSON shape, and judge prompts cover most agent regression checks.

History

Hosted runs preserve pass rates, failures, and case details for team review.