Skip to content

Agent evaluation spec

The Agent evaluation component provides two complementary workflows for assessing the quality, correctness, and safety of LLM-powered applications:

  • Online evaluation — Real-time sampling and asynchronous evaluation of production inference traffic
  • Offline evaluation — Batch evaluation of datasets for pre-deployment testing, regression analysis, and benchmarking

Both modes share a common task model (AssertionTask, LLMJudgeTask) and a common context system for extracting and passing data between evaluation steps.


Online evaluation integrates directly into the production inference path. A AgentEvalProfile is registered with the Scouter server, defining what to evaluate and how. At inference time, the application inserts EvalRecord objects into a ScouterQueue. The server samples records based on a configurable sample_ratio, then runs evaluation tasks asynchronously via the Agent Poller background worker.

Inference request
Application inserts EvalRecord into ScouterQueue
Scouter server samples based on sample_ratio
Agent Poller worker picks up sampled records
Evaluation tasks run (AssertionTask / LLMJudgeTask)
Alert conditions checked → dispatch if threshold crossed

A EvalRecord captures the context of a single inference call. It accepts a Python dict or Pydantic BaseModel as context, and an optional Prompt.

from scouter import EvalRecord
record = EvalRecord(
context={
"input": user_message,
"response": model_response,
"model": "gpt-4o",
}
)
queue["my-genai-profile"].insert(record)

The profile defines the evaluation configuration, including tasks, alert thresholds, and sampling ratio.

from scouter.drift import AgentEvalProfile
from scouter.evaluate import AgentEvalConfig
from scouter.alert import ConsoleDispatchConfig
from scouter.types import CommonCrons
profile = AgentEvalProfile(
config=AgentEvalConfig(
space="my-space",
name="my-model",
version="1.0.0",
sample_ratio=0.1, # Evaluate 10% of production traffic
alert_config=ConsoleDispatchConfig(
schedule=CommonCrons.EveryHour,
),
),
tasks=[assertion_task, judge_task],
)

Offline evaluation runs batch evaluation against a EvalDataset. This is useful for pre-deployment testing, regression analysis, and comparing model versions.

from scouter.evaluate import EvalDataset, AgentEvalConfig
dataset = EvalDataset(
config=AgentEvalConfig(
space="my-space",
name="my-model",
version="1.0.0",
),
tasks=[assertion_task, judge_task],
records=[record_1, record_2, ...],
)
results = dataset.evaluate()

Deterministic, rule-based evaluation. Supports 50+ ComparisonOperator values covering string, numeric, collection, and JSON checks.

from scouter.evaluate import AssertionTask, ComparisonOperator
task = AssertionTask(
task_id="check_json_response",
context_path="response", # dot-notation field extraction
operator=ComparisonOperator.IsJson,
condition=True, # acts as a gate: skip downstream tasks on failure
)

Key features:

  • context_path: Dot-notation path to extract a field from the record context (e.g. "response.choices.0.message.content")
  • condition=True: When set, a failed assertion skips all downstream tasks that depend on this one
  • Template variable substitution: use ${field_name} in expected values to reference context fields

LLM-powered semantic evaluation. Sends context and a prompt to an LLM and extracts a structured score using Pydantic models.

from scouter.evaluate import LLMJudgeTask
from scouter.agent import Prompt
task = LLMJudgeTask(
task_id="judge_helpfulness",
prompt=Prompt(
message="Rate the helpfulness of the following response on a scale of 1–5.\n\nUser: ${input}\nResponse: ${response}",
),
context_path="score", # extract 'score' field from LLM response
depends_on=["check_json_response"], # only run if assertion gate passed
)

Supported LLM providers:

  • OpenAI (via OPENAI_API_KEY)
  • Anthropic (via ANTHROPIC_API_KEY)
  • Google Gemini (via GOOGLE_API_KEY / service account)

Tasks can declare depends_on to form a dependency graph. Each task receives the base record context plus the outputs of all declared dependencies.

tasks = [
AssertionTask(task_id="validate_format", context_path="response", operator=ComparisonOperator.IsJson, condition=True),
LLMJudgeTask(task_id="judge_quality", depends_on=["validate_format"], ...),
AssertionTask(task_id="check_score", context_path="judge_quality.score", operator=ComparisonOperator.GreaterThan, expected=3, depends_on=["judge_quality"]),
]

The AgentPollerSettings controls the server-side worker that processes queued evaluation records.

SettingEnv VarDefaultDescription
Worker countGENAI_WORKER_COUNT2Number of concurrent evaluation workers
Max retriesGENAI_MAX_RETRIES3Retries on task failure
Trace wait timeoutGENAI_TRACE_WAIT_TIMEOUT_SECS10Seconds to wait for an associated trace before giving up
Trace backoffGENAI_TRACE_BACKOFF_MILLIS100Delay between trace polling attempts
Reschedule delayGENAI_TRACE_RESCHEDULE_DELAY_SECS30Delay before rescheduling a failed evaluation task

Version: 1.0 Last Updated: 2026-03-01 Component Owner: Steven Forrester