Skip to content

Technical Component Specification: GenAI Evaluation

Overview

The GenAI evaluation component provides two complementary workflows for assessing the quality, correctness, and safety of LLM-powered applications:

  • Online evaluation — Real-time sampling and asynchronous evaluation of production inference traffic
  • Offline evaluation — Batch evaluation of datasets for pre-deployment testing, regression analysis, and benchmarking

Both modes share a common task model (AssertionTask, LLMJudgeTask) and a common context system for extracting and passing data between evaluation steps.


Online Evaluation

How it works

Online evaluation integrates directly into the production inference path. A GenAIEvalProfile is registered with the Scouter server, defining what to evaluate and how. At inference time, the application inserts GenAIEvalRecord objects into a ScouterQueue. The server samples records based on a configurable sample_ratio, then runs evaluation tasks asynchronously via the GenAI poller background worker.

Inference request
Application inserts GenAIEvalRecord into ScouterQueue
Scouter server samples based on sample_ratio
GenAI poller worker picks up sampled records
Evaluation tasks run (AssertionTask / LLMJudgeTask)
Alert conditions checked → dispatch if threshold crossed

GenAIEvalRecord

A GenAIEvalRecord captures the context of a single inference call. It accepts a Python dict or Pydantic BaseModel as context, and an optional Prompt.

from scouter import GenAIEvalRecord

record = GenAIEvalRecord(
    context={
        "input": user_message,
        "response": model_response,
        "model": "gpt-4o",
    }
)

queue["my-genai-profile"].insert(record)

GenAIEvalProfile

The profile defines the evaluation configuration, including tasks, alert thresholds, and sampling ratio.

from scouter.evaluate import GenAIEvalProfile, GenAIEvalConfig
from scouter.alert import ConsoleDispatchConfig
from scouter.types import CommonCrons

profile = GenAIEvalProfile(
    config=GenAIEvalConfig(
        space="my-space",
        name="my-model",
        version="1.0.0",
        sample_ratio=0.1,           # Evaluate 10% of production traffic
        alert_config=ConsoleDispatchConfig(
            schedule=CommonCrons.EveryHour,
        ),
    ),
    tasks=[assertion_task, judge_task],
)

Offline Evaluation

Offline evaluation runs batch evaluation against a GenAIEvalDataset. This is useful for pre-deployment testing, regression analysis, and comparing model versions.

Workflow

from scouter.evaluate import GenAIEvalDataset, GenAIEvalConfig

dataset = GenAIEvalDataset(
    config=GenAIEvalConfig(
        space="my-space",
        name="my-model",
        version="1.0.0",
    ),
    tasks=[assertion_task, judge_task],
    records=[record_1, record_2, ...],
)

results = dataset.evaluate()

Evaluation Tasks

AssertionTask

Deterministic, rule-based evaluation. Supports 50+ ComparisonOperator values covering string, numeric, collection, and JSON checks.

from scouter.evaluate import AssertionTask, ComparisonOperator

task = AssertionTask(
    task_id="check_json_response",
    context_path="response",                  # dot-notation field extraction
    operator=ComparisonOperator.IsJson,
    condition=True,                           # acts as a gate: skip downstream tasks on failure
)

Key features: - context_path: Dot-notation path to extract a field from the record context (e.g. "response.choices.0.message.content") - condition=True: When set, a failed assertion skips all downstream tasks that depend on this one - Template variable substitution: use ${field_name} in expected values to reference context fields

LLMJudgeTask

LLM-powered semantic evaluation. Sends context and a prompt to an LLM and extracts a structured score using Pydantic models.

from scouter.evaluate import LLMJudgeTask
from scouter.genai import Prompt

task = LLMJudgeTask(
    task_id="judge_helpfulness",
    prompt=Prompt(
        message="Rate the helpfulness of the following response on a scale of 1–5.\n\nUser: ${input}\nResponse: ${response}",
    ),
    context_path="score",                    # extract 'score' field from LLM response
    depends_on=["check_json_response"],      # only run if assertion gate passed
)

Supported LLM providers: - OpenAI (via OPENAI_API_KEY) - Anthropic (via ANTHROPIC_API_KEY) - Google Gemini (via GOOGLE_API_KEY / service account)


Task Dependencies

Tasks can declare depends_on to form a dependency graph. Each task receives the base record context plus the outputs of all declared dependencies.

tasks = [
    AssertionTask(task_id="validate_format", context_path="response", operator=ComparisonOperator.IsJson, condition=True),
    LLMJudgeTask(task_id="judge_quality", depends_on=["validate_format"], ...),
    AssertionTask(task_id="check_score", context_path="judge_quality.score", operator=ComparisonOperator.GreaterThan, expected=3, depends_on=["judge_quality"]),
]

Background Worker: GenAI Poller

The GenAIPollerSettings controls the server-side worker that processes queued evaluation records.

Setting Env Var Default Description
Worker count GENAI_WORKER_COUNT 2 Number of concurrent evaluation workers
Max retries GENAI_MAX_RETRIES 3 Retries on task failure
Trace wait timeout GENAI_TRACE_WAIT_TIMEOUT_SECS 10 Seconds to wait for an associated trace before giving up
Trace backoff GENAI_TRACE_BACKOFF_MILLIS 100 Delay between trace polling attempts
Reschedule delay GENAI_TRACE_RESCHEDULE_DELAY_SECS 30 Delay before rescheduling a failed evaluation task

Version: 1.0 Last Updated: 2026-03-01 Component Owner: Steven Forrester