Agent evaluation spec
Overview
Section titled “Overview”The Agent evaluation component provides two complementary workflows for assessing the quality, correctness, and safety of LLM-powered applications:
- Online evaluation — Real-time sampling and asynchronous evaluation of production inference traffic
- Offline evaluation — Batch evaluation of datasets for pre-deployment testing, regression analysis, and benchmarking
Both modes share a common task model (AssertionTask, LLMJudgeTask) and a common context system for extracting and passing data between evaluation steps.
Online Evaluation
Section titled “Online Evaluation”How it works
Section titled “How it works”Online evaluation integrates directly into the production inference path. A AgentEvalProfile is registered with the Scouter server, defining what to evaluate and how. At inference time, the application inserts EvalRecord objects into a ScouterQueue. The server samples records based on a configurable sample_ratio, then runs evaluation tasks asynchronously via the Agent Poller background worker.
Inference request │ ▼Application inserts EvalRecord into ScouterQueue │ ▼Scouter server samples based on sample_ratio │ ▼Agent Poller worker picks up sampled records │ ▼Evaluation tasks run (AssertionTask / LLMJudgeTask) │ ▼Alert conditions checked → dispatch if threshold crossedEvalRecord
Section titled “EvalRecord”A EvalRecord captures the context of a single inference call. It accepts a Python dict or Pydantic BaseModel as context, and an optional Prompt.
from scouter import EvalRecord
record = EvalRecord( context={ "input": user_message, "response": model_response, "model": "gpt-4o", })
queue["my-genai-profile"].insert(record)AgentEvalProfile
Section titled “AgentEvalProfile”The profile defines the evaluation configuration, including tasks, alert thresholds, and sampling ratio.
from scouter.drift import AgentEvalProfilefrom scouter.evaluate import AgentEvalConfigfrom scouter.alert import ConsoleDispatchConfigfrom scouter.types import CommonCrons
profile = AgentEvalProfile( config=AgentEvalConfig( space="my-space", name="my-model", version="1.0.0", sample_ratio=0.1, # Evaluate 10% of production traffic alert_config=ConsoleDispatchConfig( schedule=CommonCrons.EveryHour, ), ), tasks=[assertion_task, judge_task],)Offline Evaluation
Section titled “Offline Evaluation”Offline evaluation runs batch evaluation against a EvalDataset. This is useful for pre-deployment testing, regression analysis, and comparing model versions.
Workflow
Section titled “Workflow”from scouter.evaluate import EvalDataset, AgentEvalConfig
dataset = EvalDataset( config=AgentEvalConfig( space="my-space", name="my-model", version="1.0.0", ), tasks=[assertion_task, judge_task], records=[record_1, record_2, ...],)
results = dataset.evaluate()Evaluation Tasks
Section titled “Evaluation Tasks”AssertionTask
Section titled “AssertionTask”Deterministic, rule-based evaluation. Supports 50+ ComparisonOperator values covering string, numeric, collection, and JSON checks.
from scouter.evaluate import AssertionTask, ComparisonOperator
task = AssertionTask( task_id="check_json_response", context_path="response", # dot-notation field extraction operator=ComparisonOperator.IsJson, condition=True, # acts as a gate: skip downstream tasks on failure)Key features:
context_path: Dot-notation path to extract a field from the record context (e.g."response.choices.0.message.content")condition=True: When set, a failed assertion skips all downstream tasks that depend on this one- Template variable substitution: use
${field_name}in expected values to reference context fields
LLMJudgeTask
Section titled “LLMJudgeTask”LLM-powered semantic evaluation. Sends context and a prompt to an LLM and extracts a structured score using Pydantic models.
from scouter.evaluate import LLMJudgeTaskfrom scouter.agent import Prompt
task = LLMJudgeTask( task_id="judge_helpfulness", prompt=Prompt( message="Rate the helpfulness of the following response on a scale of 1–5.\n\nUser: ${input}\nResponse: ${response}", ), context_path="score", # extract 'score' field from LLM response depends_on=["check_json_response"], # only run if assertion gate passed)Supported LLM providers:
- OpenAI (via
OPENAI_API_KEY) - Anthropic (via
ANTHROPIC_API_KEY) - Google Gemini (via
GOOGLE_API_KEY/ service account)
Task Dependencies
Section titled “Task Dependencies”Tasks can declare depends_on to form a dependency graph. Each task receives the base record context plus the outputs of all declared dependencies.
tasks = [ AssertionTask(task_id="validate_format", context_path="response", operator=ComparisonOperator.IsJson, condition=True), LLMJudgeTask(task_id="judge_quality", depends_on=["validate_format"], ...), AssertionTask(task_id="check_score", context_path="judge_quality.score", operator=ComparisonOperator.GreaterThan, expected=3, depends_on=["judge_quality"]),]Background Worker: Agent Poller
Section titled “Background Worker: Agent Poller”The AgentPollerSettings controls the server-side worker that processes queued evaluation records.
| Setting | Env Var | Default | Description |
|---|---|---|---|
| Worker count | GENAI_WORKER_COUNT | 2 | Number of concurrent evaluation workers |
| Max retries | GENAI_MAX_RETRIES | 3 | Retries on task failure |
| Trace wait timeout | GENAI_TRACE_WAIT_TIMEOUT_SECS | 10 | Seconds to wait for an associated trace before giving up |
| Trace backoff | GENAI_TRACE_BACKOFF_MILLIS | 100 | Delay between trace polling attempts |
| Reschedule delay | GENAI_TRACE_RESCHEDULE_DELAY_SECS | 30 | Delay before rescheduling a failed evaluation task |
Version: 1.0 Last Updated: 2026-03-01 Component Owner: Steven Forrester