agentix.evals¶
evals ¶
Evaluation harness.
Run an agent over a dataset of golden cases, score each, and get a report you can assert on in CI — so a prompt or model change that regresses quality fails the build instead of shipping.
cases = [
Case("What is 2+2?", expected="4"),
Case("Capital of France?", expected="Paris"),
]
report = await evaluate(cases, agent, scorer=contains())
report.assert_pass_rate(0.9) # raises if quality dropped
Scorers are callables (outcome, case) -> bool | Score (sync or async). Ships
exact-match / contains / regex / predicate / LLM-as-judge; write your own for
anything deterministic. MockModel (deterministic) + Store (transcript
snapshots) make eval runs reproducible.
Case
dataclass
¶
Case(
input: str,
expected: Any = None,
scorer: Scorer | None = None,
id: str | None = None,
metadata: dict[str, Any] = dict(),
)
One eval case: an input, an optional expected answer, an optional per-case scorer (overrides the default), and metadata.
EvalReport
dataclass
¶
load_cases ¶
Load eval :class:Case s from a dataset file, by extension:
.jsonl— one JSON object per line,.json— a JSON array of objects,.csv— a header row with at least aninputcolumn.
Recognized fields are input (required), expected, id, and
metadata; any other keys/columns are folded into metadata. Per-case
scorers aren't loadable from data — set them in code after loading.
evaluate
async
¶
evaluate(
dataset: Sequence[Case],
agent: Agent | AgentFactory,
*,
scorer: Scorer | None = None,
concurrency: int = 1,
) -> EvalReport
Run agent over dataset and score each case.
agent may be a single :class:Agent (reused — fine for stateless models)
or a factory Callable[[Case], Agent] (a fresh agent per case — needed for
stateful models like a scripted MockModel). scorer is the default;
a case's own scorer overrides it. concurrency runs cases in parallel.
exact_match ¶
Pass if the answer equals case.expected exactly.
contains ¶
Pass if case.expected appears in the answer.
regex_match ¶
Pass if the answer matches pattern (or case.expected if omitted).
predicate ¶
Wrap an arbitrary (outcome, case) -> bool check.
llm_judge ¶
LLM-as-judge: a model scores the answer's correctness/faithfulness.