Skip to content

agentix.evals

evals

Evaluation harness.

Run an agent over a dataset of golden cases, score each, and get a report you can assert on in CI — so a prompt or model change that regresses quality fails the build instead of shipping.

cases = [
    Case("What is 2+2?", expected="4"),
    Case("Capital of France?", expected="Paris"),
]
report = await evaluate(cases, agent, scorer=contains())
report.assert_pass_rate(0.9)   # raises if quality dropped

Scorers are callables (outcome, case) -> bool | Score (sync or async). Ships exact-match / contains / regex / predicate / LLM-as-judge; write your own for anything deterministic. MockModel (deterministic) + Store (transcript snapshots) make eval runs reproducible.

Case dataclass

Case(
    input: str,
    expected: Any = None,
    scorer: Scorer | None = None,
    id: str | None = None,
    metadata: dict[str, Any] = dict(),
)

One eval case: an input, an optional expected answer, an optional per-case scorer (overrides the default), and metadata.

EvalReport dataclass

EvalReport(results: list[CaseResult])

format_success_rate property

format_success_rate: float

Fraction of runs that completed (not aborted / errored) — a proxy for output-format success when an output_validator is configured.

assert_pass_rate

assert_pass_rate(minimum: float) -> None

Raise AssertionError if the pass rate is below minimum (for CI).

load_cases

load_cases(path: str | PathLike[str]) -> list[Case]

Load eval :class:Case s from a dataset file, by extension:

  • .jsonl — one JSON object per line,
  • .json — a JSON array of objects,
  • .csv — a header row with at least an input column.

Recognized fields are input (required), expected, id, and metadata; any other keys/columns are folded into metadata. Per-case scorers aren't loadable from data — set them in code after loading.

evaluate async

evaluate(
    dataset: Sequence[Case],
    agent: Agent | AgentFactory,
    *,
    scorer: Scorer | None = None,
    concurrency: int = 1,
) -> EvalReport

Run agent over dataset and score each case.

agent may be a single :class:Agent (reused — fine for stateless models) or a factory Callable[[Case], Agent] (a fresh agent per case — needed for stateful models like a scripted MockModel). scorer is the default; a case's own scorer overrides it. concurrency runs cases in parallel.

exact_match

exact_match(
    *, strip: bool = True, case_sensitive: bool = False
) -> Scorer

Pass if the answer equals case.expected exactly.

contains

contains(*, case_sensitive: bool = False) -> Scorer

Pass if case.expected appears in the answer.

regex_match

regex_match(pattern: str | None = None) -> Scorer

Pass if the answer matches pattern (or case.expected if omitted).

predicate

predicate(
    fn: Callable[[AgentOutcome, Case], bool],
) -> Scorer

Wrap an arbitrary (outcome, case) -> bool check.

llm_judge

llm_judge(
    model: ModelFn, *, rubric: str | None = None
) -> Scorer

LLM-as-judge: a model scores the answer's correctness/faithfulness.