Skip to content

Measuring quality (evals)

How do you know your agent is good — and that a change to the prompt or model didn't quietly make it worse? You evaluate it: run it against a set of example questions with known-good answers, and score how it does. This is often called "evals," and it's how you catch quality regressions before your users do.

A quick eval

Write some cases (an input and what you expect), run the agent over them, and get a report:

from agentix import Case, evaluate, contains

cases = [
    Case("What is 2+2?", expected="4"),
    Case("Capital of France?", expected="Paris"),
]

report = await evaluate(cases, agent, scorer=contains())
print(f"{report.passed}/{report.total} passed ({report.pass_rate:.0%})")
report.assert_pass_rate(0.9)     # raises an error if fewer than 90% pass

That last line is the trick: drop it into your test suite, and a change that drops quality below your bar fails the build — just like a normal failing test.

Scoring

A scorer decides whether one answer is good enough. Pick one that fits:

Scorer Passes when…
exact_match() the answer equals the expected text
contains() the answer contains the expected text
regex_match(...) the answer matches a pattern
predicate(fn) your own function returns True
llm_judge(...) another model judges it against a rubric

→ Runnable example: examples/17_eval.py

Loading cases from a file

Keep your test cases in a data file instead of in code — load_cases reads .jsonl, .json, or .csv:

from agentix import load_cases

cases = load_cases("tests/cases.jsonl")

Each row needs at least an input; expected, id, and any extra columns are picked up too.

Double-checking answers

Two more tools for trustworthy results, covered in Reliability: ask the model the same question several times and take the majority answer (SelfConsistencyModel), or have a second model review the final answer before it goes out (JudgeGuard).

→ Runnable example: examples/18_verification.py