Skip to content

Reading results

After results.as_table(), you’ll see up to three tables. Here’s what each one measures and why they’re separate.


EvalOrchestrator runs two independent evaluation loops per scenario:

EvalOrchestrator.run()
├── For each scenario:
│ ├── agent runs
│ │ └── sub-agents emit EvalRecords ← Workflow Summary evaluates these
│ └── agent returns a final response string ← Scenario Results evaluates this
└── Aggregate Metrics = rolled-up numbers from both loops

Scenario evaluation is the passenger’s view: did the car get from A to B? Tasks defined in EvalScenario(tasks=[...]) run against the agent’s final response string.

Workflow evaluation is the mechanic’s view: are the internals healthy? Tasks defined in AgentEvalProfile(tasks=[...]) run against the EvalRecord each sub-agent emitted during execution.

They measure different things. A scenario can pass (good final output) while workflow tasks fail — a sub-agent produced low-quality intermediate results but the final answer happened to be correct. You got lucky. Tracking both is the point.


Printed by results.as_table(). Rolled-up numbers from both loops.

MetricWhat it measures
Overall Pass RateMean pass rate across both the scenario and workflow loops
Workflow Pass RateMean pass rate across all sub-agent profile evaluations
Scenario Pass RateFraction of scenarios where every scenario task passed
Total ScenariosNumber of scenarios run
Passed ScenariosScenarios where every scenario task passed

If you haven’t defined any AgentEvalProfile tasks, Workflow Pass Rate is omitted and Overall Pass Rate reflects only the scenario loop.


One row per scenario task. This is the black-box view: output correctness only.

Tasks come from EvalScenario(tasks=[...]). They run against the string execute_agent() returned for that scenario.

EvalScenario(
id="capital_question",
initial_query="What is the capital of France?",
tasks=[
AssertionTask(
id="response_not_empty",
context_path="response",
operator=ComparisonOperator.IsString,
expected_value=True,
)
],
)

The response context key is populated automatically from the string your agent returns. No manual wiring needed.


Printed when you call results.as_table(show_workflow=True). One row per EvalRecord emitted by a sub-agent across all scenarios.

Tasks come from AgentEvalProfile(tasks=[...]), the profile attached to your ScouterQueue. Each sub-agent that calls span.add_queue_item(alias, record) during execution produces rows here.

ColumnWhat it shows
Scenario IDWhich scenario produced this record
Record UIDLast 8 chars of the record UUID (enough to distinguish records within a scenario)
AliasThe sub-agent name (matches what you passed to span.add_queue_item)
TaskWhich profile task was evaluated
PassedWhether that task passed for this record
Pass RatePass rate across all tasks for this record
# During agent execution, a sub-agent emits a record:
span.add_queue_item(
"retriever",
EvalRecord(context={"results": {"count": 5, "source": "arxiv"}}),
)
# AgentEvalProfile defines what to check on that record:
AgentEvalProfile(
alias="retriever",
tasks=[
AssertionTask(
id="has_results",
context_path="results.count",
operator=ComparisonOperator.GreaterThanOrEqual,
expected_value=1,
)
],
)

With 3 scenarios and 2 sub-agents each emitting 1 record with 2 tasks, you get 12 Workflow Summary rows.


ScenarioEvalResults can be saved, loaded, and compared across runs:

# Save a baseline after a known-good run
results.save("baseline_v1.json")
# Later — after a model update, prompt change, etc.
baseline = ScenarioEvalResults.load("baseline_v1.json")
new_results = orch.run()
comparison = new_results.compare_to(baseline, regression_threshold=0.05)
comparison.as_table()
if comparison.regressed:
print(f"Regressed on: {comparison.regressed_aliases}")
raise SystemExit(1)

regression_threshold is the minimum pass-rate drop that counts as a regression. Default is 0.05, a 5-point drop. ScenarioComparisonResults also serializes to JSON.