Skip to content

Observability and traces

Agents are multi-step pipelines. The final output tells you what happened. Traces tell you how. Without them, you can’t answer: did the retriever call the right API? Did the agent retry too many times? Was the right model used? Did latency exceed the SLA? Did total token usage blow the budget?

Scouter does not make every evaluation depend on tracing. span.attach_eval() creates an EvalRecord and records correlation metadata. The record still follows the normal online eval path unless its registered profile includes TraceAssertionTask.

When a profile does include trace assertions, spans become part of the eval input. TraceAssertionTask runs assertions directly on OpenTelemetry span properties, so Scouter uses the inbox pattern below to wait for the specific anchor span before the poller runs that workflow.

For tracing setup and instrumentation, see Distributed tracing. For the full task type reference, see Building blocks.

For system-wide ingestion pipeline details, see Architecture.


OTEL compatibility and the Rust/Python boundary

Section titled “OTEL compatibility and the Rust/Python boundary”

Scouter’s tracing is a standard OpenTelemetry implementation, not a proprietary format. ScouterInstrumentor implements the OTEL BaseInstrumentor interface and registers as the global TracerProvider. Any OTEL auto-instrumentation library (httpx, FastAPI, Starlette) works automatically once the instrumentor is active.

The architecture splits across two languages. Python owns the API surface: ScouterInstrumentor, get_tracer(), span context managers, attribute setting. Rust owns the performance path: span lifecycle, batching, export, and serialization. The boundary is PyO3. When you call tracer.start_as_current_span(), Python creates a thin wrapper around a Rust ActiveSpan. Attribute writes, event recording, and status updates cross the FFI boundary into Rust. The GIL is released during export.

graph LR
    subgraph "Python (your application)"
        SI["ScouterInstrumentor"] --> ST["ScouterTracer"]
        ST --> SP["ScouterSpan (wrapper)"]
    end

    subgraph "Rust (PyO3)"
        SP --> RS["ActiveSpan"]
        RS --> BP["BatchSpanProcessor"]
        BP --> SE["ScouterSpanExporter"]
        BP --> OE["OtlpSpanExporter (optional)"]
    end

    subgraph "Export"
        SE -->|"gRPC / HTTP / Kafka / RabbitMQ / Redis"| SV["Scouter server"]
        OE -->|"OTLP"| EXT["External collector"]
    end

Every span always goes to the Scouter server. Optionally, spans also export to an external OTEL collector (Jaeger, Datadog, Grafana Tempo) via HttpSpanExporter or GrpcSpanExporter. This dual-export design means you don’t have to choose between Scouter’s evaluation pipeline and your existing observability stack.

W3C traceparent and baggage propagators are registered globally during instrumentation. Distributed traces across services work without additional configuration. ScouterTracingMiddleware (ASGI middleware) extracts incoming traceparent headers and starts server spans automatically for FastAPI/Starlette applications.

Outbound propagation is manual (get_tracing_headers_from_current_span()) or automatic via httpx auto-instrumentation.

Beyond standard OTEL attributes, Scouter adds:

  • set_input(value) / set_output(value): capture function arguments and return values (truncated to configurable max length)
  • set_entity(entity_id): link the span to an evaluation entity (profile UID)
  • set_tag(key, value): indexed tags stored in PostgreSQL for fast lookup
  • attach_eval(profile_uid, context, *, record_id, session_id, media, tags): insert an EvalRecord with trace correlation, setting trace_id and span_id from the active span

For the full instrumentor API, see Instrumentor guide.

Instrumentation approaches: where Scouter differs

Section titled “Instrumentation approaches: where Scouter differs”

Most LLM observability platforms auto-instrument by patching library internals at runtime. Datadog’s ddtrace wraps methods like openai.resources.chat.Completions.create directly. MLflow’s autologging uses the gorilla library to replace SDK methods in-place. Arize Phoenix’s OpenInference project uses wrapt.wrap_function_wrapper on OpenAI, Anthropic, and LangChain internals (it outputs OTEL spans, but the capture mechanism is still patching). Langfuse started with drop-in client wrappers (from langfuse.openai import OpenAI), though their SDK v3 has rebuilt on OTEL internally and now also accepts raw OTLP.

Patching works. It gives you zero-config tracing for supported libraries, and for prototyping that’s a real advantage. But it couples your observability to specific library versions. MLflow’s Anthropic support, for example, is explicitly documented as compatible with versions 0.43.1 through 0.76.0. When Anthropic restructures their SDK or LangChain refactors its chain execution path (both have happened), the patches break and you wait for the vendor to update. The agent ecosystem moves fast enough that this isn’t a theoretical concern.

Scouter doesn’t auto-instrument LLM calls. It provides a standard OTEL TracerProvider and expects your application (or your framework’s native OTEL support) to emit spans with gen_ai.* attributes. The contract is the OTEL spec, not any library’s internal API. Framework upgrades don’t break tracing. A new framework works on day one if it emits standard spans.

Auto-instrumentation (patching)OTEL-native (Scouter’s approach)
Setup effortZero-config for supported librariesYou set span attributes, or rely on framework OTEL support
Library couplingPatches target specific SDK methods and versionsNone. OTEL spec is the only contract.
Version resilienceBreaks when patched internals changeSurvives library upgrades
CoverageCaptures all calls to supported libraries automaticallyOnly captures what you (or your framework) instruments
Multi-frameworkEach framework needs its own patch setWorks across any OTEL-compliant framework
Maintenance burdenVendor must track every SDK releaseZero per-framework maintenance

The trade-off is real. If your framework doesn’t emit OTEL spans natively yet, you write a few lines of attribute-setting per LLM call. For production systems where you control the instrumentation, we think that’s the right trade. For quick prototyping against a single provider, auto-instrumentation is faster to start with. Both are valid choices depending on where you are in the lifecycle.


Scouter implements the OpenTelemetry GenAI semantic conventions across its entire storage and query layer. The GenAI span schema stores ~45 columns that map directly to the OTEL GenAI specification. This gives Scouter parity with any agentic framework that emits OTEL-compatible GenAI spans.

If your framework (LangChain, CrewAI, Google ADK, or anything else) produces standard gen_ai.* span attributes, Scouter stores them natively in a columnar format optimized for analytical queries. No attribute flattening, no JSON blob storage, no loss of structure.

The attributes stored as dedicated columns:

CategoryAttributes
Operationgen_ai.operation.name, gen_ai.provider.name
Modelgen_ai.request.model, gen_ai.response.model, gen_ai.response.id
Token usagegen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_creation.input_tokens, gen_ai.usage.cache_read.input_tokens, gen_ai.usage.reasoning.output_tokens
Request paramsgen_ai.request.temperature, gen_ai.request.max_tokens, gen_ai.request.top_p, gen_ai.request.top_k, gen_ai.request.stream, gen_ai.request.encoding_formats, gen_ai.request.seed, gen_ai.request.frequency_penalty, gen_ai.request.presence_penalty, gen_ai.request.stop_sequences, gen_ai.request.choice.count
Responsegen_ai.response.finish_reasons, gen_ai.response.time_to_first_chunk, gen_ai.output.type
Agentgen_ai.agent.name, gen_ai.agent.id, gen_ai.agent.description, gen_ai.agent.version
Conversationgen_ai.conversation.id, gen_ai.data_source.id
Tool callsgen_ai.tool.name, gen_ai.tool.type, gen_ai.tool.description, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result
Prompt / Workflowgen_ai.prompt.name, gen_ai.workflow.name
Embeddingsgen_ai.embeddings.dimension.count
Retrievalgen_ai.retrieval.documents, gen_ai.retrieval.query.text
Content (opt-in)gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions, gen_ai.tool.definitions
Serverserver.address, server.port
Errorerror.type
Evaluationgen_ai.evaluation.result events with name, score.label, score.value, explanation

See the GenAI semantic conventions page for Google ADK / Vertex vendor fallback details and sensitive-content redaction rules.

These attributes are set by your application code on spans. Scouter doesn’t auto-instrument LLM calls; your framework or manual instrumentation provides the attributes. What Scouter does is store them in a columnar layout where each attribute is a typed column with dictionary encoding on high-repetition fields (provider_name, request_model, response_model, operation_name, output_type). Token columns use delta encoding. The result: queries like “average input tokens by model over the last 24 hours” scan compressed, sorted columns instead of parsing JSON blobs.

GenAI spans live in a separate Delta Lake table (gen_ai_spans) alongside the base trace span table (trace_spans). Both tables share trace_id and span_id for correlation. The separation keeps the base trace schema lean while giving the GenAI table freedom to evolve with the OTEL specification without schema conflicts.


Spans flow through five stages from your application to Delta Lake storage.

graph LR
    subgraph "Application"
        A["span.end()"]
    end

    subgraph "Rust exporter"
        A --> B["BatchSpanProcessor<br/>512 spans / 5s flush"]
        B --> C["ScouterSpanExporter<br/>→ ExportTraceServiceRequest"]
    end

    subgraph "Transport"
        C --> D["gRPC / HTTP / Kafka /<br/>RabbitMQ / Redis"]
    end

    subgraph "Server ingestion"
        D --> R["Message router"]
        R --> E["Trace channel<br/>flume (bounded: 500)"]
        R --> E2["Server records channel<br/>flume (bounded: 1000)"]
        R --> E3["Tag channel<br/>flume (bounded: 200)"]
    end

    subgraph "Trace workers"
        E --> F["MessageHandler"]
        F --> G["TraceSpanService<br/>(buffer actor)"]
        F --> H["GenAiSpanService<br/>(if gen_ai attrs present)"]
        F --> I["PostgreSQL<br/>(baggage)"]
        G --> J["TraceSpanDBEngine<br/>(Delta Lake writer)"]
        H --> K["GenAiSpanDBEngine<br/>(Delta Lake writer)"]
    end
  1. Your span ends. The Rust BatchSpanProcessor queues it. When 512 spans accumulate or 5 seconds pass, the batch exports.

  2. ScouterSpanExporter converts the OTEL SpanData batch into a single ExportTraceServiceRequest (gRPC protobuf format). This is the standard OTLP wire format.

  3. The serialized request hits the server through whatever transport you configured. The server’s message router dispatches it to one of three dedicated flume channels based on record type: trace records (bounded at 500), server records like drift profiles (bounded at 1000), and tag records (bounded at 200). Each channel has its own worker pool, so trace ingestion never competes with drift record processing for channel capacity.

  4. Trace workers parse the OTLP request into internal types: TraceSpanRecord (span data), TraceBaggageRecord (W3C baggage), and TagRecord (indexed tags). Span data goes to Delta Lake. Baggage goes to PostgreSQL.

  5. If the span has gen_ai.* attributes, a GenAiSpanRecord is extracted and sent to the GenAI storage engine in parallel.

Spans don’t write to Delta Lake one at a time. A buffer actor accumulates spans and flushes on a size threshold or a 5-second timer. The flushed batch goes to the engine actor, which converts spans to an Arrow RecordBatch and commits to Delta Lake.

Each pod runs a single engine actor that serializes all local writes through an mpsc channel. Across pods, Delta Lake’s optimistic concurrency control handles concurrent appends to the same table on shared object storage. Multiple pods can write spans simultaneously without coordination; the Delta log resolves conflicts via OCC.

Compaction is a different story. Only one pod should compact at a time, and the ControlTableEngine (a Delta-backed coordination table) enforces this via distributed task claiming. Post-compaction vacuum runs with retention_hours=0, which assumes no concurrent reader is referencing an older table version. This aggressive vacuum is safe under the current architecture because every pod refreshes its snapshot on a short interval (SCOUTER_TRACE_REFRESH_INTERVAL_SECS, default 10 seconds), but it does mean the refresh interval sets a lower bound on vacuum safety.

After each write, the engine does an atomic catalog.swap() on the DataFusion SessionContext. This replaces the TableProvider in a single operation with no deregister/register gap. Queries in flight see the old provider (consistent); new queries see the fresh data.

The search_blob column is pre-computed at ingest: service name, span name, and flattened attributes concatenated into a single string. Attribute filter queries use LIKE '%key:value%' with a custom match_attr UDF. No JSON parsing at query time.

Span hierarchy fields (depth, span_order, path, root_span_id) are not stored. They’re computed at query time via a Rust DFS traversal. This matches how Jaeger and Zipkin handle it, and avoids ingest ordering dependencies (child spans can arrive before parents).


Queries go through DataFusion, which handles partition pruning, bloom filter evaluation, and predicate pushdown against the Delta Lake table.

sequenceDiagram
    participant C as Client
    participant API as API route
    participant DF as DataFusion SessionContext
    participant DL as Delta Lake (object store)
    participant CS as CachingStore

    C->>API: GET /scouter/v1/traces/{trace_id}
    API->>DF: SELECT * FROM trace_spans WHERE trace_id = ? AND start_time BETWEEN ? AND ?
    DF->>DL: Partition pruning (partition_date)
    DF->>CS: Row group pruning (bloom filter on trace_id)
    CS-->>DF: Cached or fetched Parquet ranges
    DF-->>API: RecordBatch results
    API->>API: build_span_tree() (Rust DFS)
    API-->>C: Vec of Span with hierarchy

Three pruning layers fire in sequence:

  1. Partition pruning. The partition_date column (Hive-style partitioning) eliminates entire date directories. A query for “last 2 hours” skips all other days.

  2. Bloom filter pruning. Each row group carries a bloom filter on trace_id (FPP 0.01). For single-trace lookups, this skips ~99% of row groups within the surviving partitions. DataFusion’s reorder_filters=true ensures the bloom filter evaluates before more expensive predicates.

  3. Column statistics. Page-level min/max statistics on start_time and status_code further narrow the scan within surviving row groups.

After DataFusion returns the matching RecordBatch, a Rust DFS traversal builds the span tree: assigning depth, ordering, paths, and root span IDs. Deep traces (hundreds of spans) will be slower to assemble, but typical agent traces have tens of spans.

A CachingStore wraps the object store (S3, GCS, Azure, or local) with two caches:

  • HEAD cache: file metadata lookups. 10,000 entries, 1-hour TTL.
  • Range cache: Parquet byte ranges up to 2 MB. Sized by configurable memory budget, 1-hour TTL. Covers footer reads and column index lookups.

After Z-ORDER compaction, Parquet files are immutable, so caching is safe. Delete operations invalidate the relevant cache entries.


Trace data lives in Delta Lake on any supported object store. Two tables, same storage path, same optimization strategy.

BackendConfigNotes
Local filesystemSCOUTER_STORAGE_URI=./scouter_storageDefault. Good for development.
Amazon S3SCOUTER_STORAGE_URI=s3://bucket/pathUses AmazonS3Builder::from_env(). Region via settings.
Google Cloud StorageSCOUTER_STORAGE_URI=gs://bucket/pathCredentials via GOOGLE_ACCOUNT_JSON_BASE64.
Microsoft AzureSCOUTER_STORAGE_URI=az://container/pathDual env var naming supported (AZURE_STORAGE_ACCOUNT_NAME or AZURE_STORAGE_ACCOUNT).

Cloud connections use pooled HTTP clients: 64 max idle connections per host, 120-second idle timeout, 30-second request timeout.

TablePurposeKey columnsPartition
trace_spansAll spans from all servicestrace_id, span_id, parent_span_id, service_name, span_name, start_time, duration_ms, attributes, events, links, input, output, search_blobpartition_date
gen_ai_spansGenAI-specific span datatrace_id, span_id, provider_name, request_model, response_model, input_tokens, output_tokens, operation_name, agent_name, tool_name, conversation_id, + ~30 morepartition_date

Both tables correlate on (trace_id, span_id). The base table stores everything; the GenAI table stores the typed, columnar representation of gen_ai.* attributes that would otherwise be buried in a JSON attributes map.

SettingValueWhy
Row group size32,768 rows~4 groups per 128 MB file. Enough granularity for bloom filter and statistics pruning.
Bloom filterstrace_id (FPP 0.01, NDV 32,768), service_name, conversation_idSingle-trace lookups skip 99% of row groups.
Column statisticsPage-level on start_time, status_codeFinest-grained time pruning within row groups.
Delta encodingstart_time, duration_ms, input_tokens, output_tokens4-8x compression on sorted numeric sequences.
Dictionary encodingservice_name, span_name, provider_name, request_model, response_model, operation_name, output_typeLow-cardinality string columns compress to integer indices.
CompressionZSTD level 3~40% better than Snappy at marginal decompression cost.

Z-ORDER compaction rewrites Parquet files sorted by a space-filling curve on (start_time, service_name). This co-locates data from the same service and time window, which makes time-bounded queries per service scan fewer files.

Compaction runs on a schedule controlled by the ControlTableEngine, a Delta Lake table that coordinates background tasks across pods. Each task (compaction, vacuum, retention) uses Delta’s optimistic concurrency to claim ownership:

  1. Pod calls try_claim_task(). Delta OCC ensures only one pod wins.
  2. Winner runs compaction (target: 128 MB files). Stale claims (older than 30 minutes) are auto-reclaimed.
  3. Immediate vacuum follows (retention_hours=0). This is aggressive but safe because all pods refresh their Delta snapshot on a short interval, so no pod holds references to stale files for long.
  4. Pod releases the task and schedules the next run.

Data retention uses a logical delete (DELETE WHERE partition_date < cutoff) followed by a vacuum to reclaim disk space. Configurable via DATA_RETENTION_PERIOD (default 30 days).

In a multi-pod deployment, every pod runs its own engine actor and can write spans. Delta Lake OCC handles concurrent appends. Each pod also refreshes its in-memory snapshot on an interval to pick up commits from other pods:

VariableDefaultEffect
SCOUTER_TRACE_REFRESH_INTERVAL_SECS10How often each pod calls update_incremental() to pick up new commits from shared object storage.

Lower values mean faster cross-pod visibility. Higher values reduce object-store LIST calls. The refresh updates the DataFusion TableProvider via an atomic catalog.swap(). This interval also sets the floor for vacuum safety: if a pod hasn’t refreshed before vacuum runs, it could hold references to deleted files. At the default 10-second interval, this is not a practical concern.

The DataFusion SessionContext is tuned for trace workloads:

SettingValuePurpose
Target partitionsCPU cores (min 4)Parallel execution across row groups
Batch size8,192 rowsIn-memory record batch buffer
Parquet pruningenabledRow group filtering via statistics
Pushdown filtersenabledPush predicates into Parquet reader
Reorder filtersenabledEvaluate bloom filters before expensive predicates
Bloom filter on readenabledConsult bloom filters before decoding row groups
Schema force view typesenabledKeep Utf8View/BinaryView zero-copy
Metadata size hint1 MBReduce cloud round-trips for Parquet footers
Meta fetch concurrency64Parallel file stats fetching (matches connection pool)

Evaluation does not depend on tracing by default. span.attach_eval() creates an EvalRecord, fills in trace_id and span_id from the active span, and stamps the span with scouter.eval.record_uid and scouter.eval.profile_uid. That is correlation metadata. It does not, by itself, make the record wait for trace storage.

The server decides whether trace availability gates the workflow by reading the registered AgentEvalProfile. If the profile has no TraceAssertionTask, the record is inserted as pending immediately and evaluated like any other online record. The trace IDs remain useful for later lookup, but they do not slow down evaluation.

The inbox pattern exists for one case: profiles that include TraceAssertionTask. Those tasks read spans from Delta Lake, so Scouter must make sure the specific anchor span is committed before the poller runs the workflow. For those profiles, readiness means the exact (record_uid, trace_id, span_id) anchor has committed to Delta Lake, not merely that some span with the same trace_id exists.

The important split for trace-assertion workflows is this: Delta Lake is the source of truth, PostgreSQL is the eval workflow state machine, and the in-memory channel is only a latency path. If the channel drops an anchor event, correctness still comes from reconciliation scanning Delta and recreating the queue row.

flowchart TD
    subgraph APP["Application process"]
        SP["Active span"]
        AE["span.attach_eval()<br/>creates EvalRecord<br/>stamps span metadata"]
        ER["EvalRecord<br/>context + trace_id + span_id"]
        END["span.end()"]
        SP --> AE --> ER
        SP --> END
    end

    subgraph EVAL_INGEST["Eval record ingestion"]
        MR["Message router"]
        INSERT["insert agent_eval_record"]
        NEEDS_TRACE{"workflow includes<br/>TraceAssertionTask?"}
        PROBE["probe trace_commit_event<br/>by anchor tuple"]
        FAIL_TRACE["failed(EvalRequiresTrace)"]
        FAIL_ANCHOR["failed(EvalRequiresAnchorSpan)"]
        FAIL_TIMEOUT["failed(TraceArrivalTimeout)"]
        AWAIT["agent_eval_record<br/>status = awaiting_trace"]
        PENDING_FAST["agent_eval_record<br/>status = pending"]
        PENDING_TRACE["agent_eval_record<br/>status = pending<br/>ready_at = now + visibility buffer"]
    end

    subgraph TRACE_INGEST["Trace ingestion"]
        EXPORT["BatchSpanProcessor / exporter"]
        TRACE_CH["trace channel"]
        BUFFER["TraceSpanService<br/>buffer actor"]
        ENGINE["TraceSpanDBEngine<br/>Arrow + Delta append"]
        EXTRACT["extract committed anchors<br/>only spans with record_uid + profile_uid"]
    end

    subgraph STORES["Durable stores"]
        DELTA["Delta Lake trace_spans<br/>source of truth"]
        QUEUE["PostgreSQL trace_commit_event<br/>durable queue + audit log"]
    end

    subgraph LIVE["Live latency path"]
        COMMIT_TX["bounded commit_tx<br/>try_send, may drop"]
        WRITER["inbox writer<br/>ON CONFLICT DO NOTHING"]
    end

    subgraph WORKERS["Background workers"]
        DRAIN["inbox drain<br/>claim pending rows<br/>FOR UPDATE SKIP LOCKED"]
        COMPLETE["complete owned rows<br/>claim_token fenced"]
        RECON["reconciliation sweep<br/>old awaiting_trace rows"]
        LEASE["lease recovery<br/>stale processing rows"]
        TIMEOUT["timeout sweep<br/>TraceArrivalTimeout"]
        POLLER["AgentPoller<br/>SELECT pending rows"]
        EVAL["task DAG + TraceAssertionTask"]
        RESULTS["workflow results + alerts"]
    end

    ER --> MR --> INSERT --> NEEDS_TRACE
    NEEDS_TRACE -->|"no"| PENDING_FAST
    NEEDS_TRACE -->|"yes, missing trace_id"| FAIL_TRACE
    NEEDS_TRACE -->|"yes, trace_id present,<br/>missing span_id"| FAIL_ANCHOR
    NEEDS_TRACE -->|"yes, complete anchor"| PROBE
    PROBE -->|"exact anchor queued"| PENDING_TRACE
    PROBE -->|"not seen yet"| AWAIT

    END --> EXPORT --> TRACE_CH --> BUFFER --> ENGINE --> DELTA
    ENGINE --> EXTRACT --> COMMIT_TX
    COMMIT_TX -->|"delivered"| WRITER --> QUEUE
    COMMIT_TX -.->|"dropped / pod died"| RECON

    AWAIT --> RECON
    RECON -->|"query Delta by record_uid,<br/>trace_id, span_id, time window"| DELTA
    RECON -->|"insert missing anchors"| QUEUE

    QUEUE --> DRAIN --> COMPLETE
    COMPLETE -->|"flip matching anchor tuple only"| PENDING_TRACE
    QUEUE --> LEASE -->|"retry or dead_lettered"| QUEUE
    AWAIT --> TIMEOUT -->|"older than timeout"| FAIL_TIMEOUT

    PENDING_FAST --> POLLER
    PENDING_TRACE -->|"after ready_at"| POLLER --> EVAL
    EVAL -->|"fetch spans by trace_id"| DELTA
    EVAL --> RESULTS

Read the diagram left to right by concern:

  • Eval record ingestion writes agent_eval_record. If the workflow has no TraceAssertionTask, the record goes straight to pending. If trace assertions are present, ingestion never guesses that a trace is ready because some span in the trace exists; it either finds the exact anchor tuple or leaves the record in awaiting_trace.
  • Trace ingestion writes spans to Delta first. Only after the Delta commit succeeds does the engine extract anchor spans and send them toward the inbox writer.
  • The live path is deliberately best-effort. commit_tx is bounded and uses try_send; dropping an event is acceptable because the event can be rebuilt from Delta.
  • The durable queue is trace_commit_event. The inbox worker claims queue rows, completes the rows it owns, and flips only eval rows that match the returned (record_uid, trace_id, span_id) anchor.
  • Reconciliation is the correctness path. It scans older awaiting_trace rows, queries Delta for their anchor spans, and inserts any missing queue rows. It does not update agent_eval_record directly.
  • The poller evaluates pending records. Only records that were gated for trace assertions receive the ready_at visibility buffer; normal eval records do not wait on tracing.

Both examples below create eval records for the same online evaluation system. The difference is whether the record includes trace correlation metadata.

Trace-correlated record:

with tracer.start_as_current_span("agent.callback") as span:
span.attach_eval(
profile_uid=profile.config.uid,
context={"query": query, "response": response},
record_id="turn-7", # optional: scenario/turn/step ID
session_id="sess-123", # optional: session/conversation ID
media=None, # optional: multimodal evidence
tags=["run=abc"], # optional: key=value tags
)
  • attach_eval() creates the EvalRecord; the server still owns workflow routing.
  • If the registered profile has no TraceAssertionTask, the record is inserted as pending.
  • If the profile has TraceAssertionTask, the server checks trace_commit_event for the exact (record_uid, trace_id, span_id, profile_uid) anchor.
  • If the anchor queue row already exists, the record is inserted as pending with a visibility delay.
  • If the anchor has not been delivered yet, the record is inserted as awaiting_trace.

Content-only record:

queue.insert(EvalRecord(
profile_uid=profile.config.uid,
context={"query": query, "response": response},
record_id="turn-7",
session_id="sess-123",
))
  • No trace_id or span_id is attached.
  • If the profile has no TraceAssertionTask, the record is inserted as pending.
  • If the profile has TraceAssertionTask, the record fails with failed(EvalRequiresTrace) because the workflow asked for span data that this record cannot provide.

The trace_commit_event table is a durable queue and audit log for anchor spans. It is only used to unblock eval records waiting on trace assertions. Rows move through pending → processing → processed | dead_lettered and carry attempt_count, claimed_at, claimed_by, claim_token, and last_error. The inbox worker claims rows with FOR UPDATE SKIP LOCKED, completes owned queue rows, then flips only eval rows that match the returned anchor tuple.

Delta Lake remains the source of truth. The live channel from the engine actor is a latency optimization: it uses bounded try_send, may drop when full or closed, and increments scouter_trace_commit_event_channel_drop_total. A reconciliation sweep recovers correctness by scanning old awaiting_trace eval rows, querying Delta with bounded time predicates, and inserting missing queue rows with ON CONFLICT DO NOTHING. Reconciliation never writes agent_eval_record.

Failure modes:

ScenarioBehavior
Record has trace IDs but workflow has no trace assertionsRecord inserts as pending; tracing is correlation metadata only
Anchor commits before record arrivesShort-circuit: record inserted as pending by probing the exact anchor
Anchor commits after record inserted as awaiting_traceInbox worker completes the owned queue row, then flips the eval row that matches (record_uid, trace_id, span_id) to pending
Live anchor channel dropsReconciliation finds the anchor in Delta and inserts the missing queue row
Worker dies while processingLease recovery returns the row to pending or moves it to dead_lettered after max attempts
Trace never commits (client crashed, network partition)Record remains awaiting_trace for 5 minutes, then marked failed(TraceArrivalTimeout)
Profile requires trace assertions but record has no trace_idRecord fails immediately with failed(EvalRequiresTrace)
Profile requires trace assertions and record has trace_id but no span_idRecord fails immediately with failed(EvalRequiresAnchorSpan)

Inbox worker: Claims pending queue rows, completes rows fenced by claim_token, and flips only eval rows matching the returned (record_uid, trace_id, span_id) anchors to pending.

Reconciliation sweep: Scans old awaiting_trace rows and queries Delta for their anchor spans. It inserts missing trace_commit_event rows only.

Lease recovery sweep: Recovers stuck processing rows after the lease TTL. Rows are retried until max_attempts, then kept as dead_lettered for operator triage.

Timeout sweep: Background task that scans awaiting_trace rows older than 5 minutes (configurable via TRACE_ARRIVAL_TIMEOUT_SECS). Marks them as failed(TraceArrivalTimeout) and emits a metric for alerting.

Prune sweep: Removes old processed queue rows after 24 hours. Pending, processing, and dead-lettered rows are not pruned automatically.

Operators should alert on sustained channel drops, non-zero dead-lettered rows, rising pending count, and oldest pending age above the reconciliation interval. The main metrics are scouter_trace_commit_event_pending_count, scouter_trace_commit_event_oldest_pending_age_seconds, scouter_trace_commit_event_channel_drop_total, scouter_trace_commit_event_reconciled_total, scouter_trace_commit_event_dead_lettered_total, scouter_trace_commit_event_lease_recovered_total, and the claim/flip histograms.

Migration note: 20260512000000_trace_commit_anchor_queue.sql is forward-only. Stop the server during the migration; the old trace-level inbox rows are intentionally dropped and any remaining awaiting_trace rows are recovered by reconciliation or failed by the timeout sweep.

VariableDefaultEffect
TRACE_ARRIVAL_TIMEOUT_SECS300How long to wait for a trace to commit before failing the record
INBOX_POLL_INTERVAL_SECS10How often the inbox worker checks for new commit events
TRACE_COMMIT_EVENT_RETENTION_HOURS24How long to keep processed commit event rows before pruning
SCOUTER_TRACE_REFRESH_INTERVAL_SECS10How often each pod refreshes its Delta snapshot
SCOUTER_TRACE_VISIBILITY_BUFFER_SECSrefresh + 2Delay before pending trace evals are polled; startup fails if below refresh + 2
SCOUTER_INBOX_RECONCILE_AFTER_SECS15How old an awaiting_trace row must be before reconciliation scans for it
SCOUTER_INBOX_RECONCILE_LOOKBACK_SECS86400Maximum supported anchor span start lookback used by reconciliation Delta queries
SCOUTER_INBOX_RECONCILE_INTERVAL_SECS60Reconciliation tick interval
SCOUTER_INBOX_RECONCILE_BATCH200Awaiting rows scanned per reconciliation tick

Scouter’s observability engine and evaluation pipeline share the same trace data, but only trace-assertion workflows are gated on it. Spans in Delta Lake feed TraceAssertionTask evaluations. Records for profiles without trace assertions can still carry trace IDs for lookup, but they are evaluated without waiting for trace storage.

For trace-assertion workflows, the trace storage, evaluation engine, and alert pipeline become one data path: spans arrive, anchor events unblock matching records, evaluations run, results get stored, and alerts fire.

For Delta Lake schema details and compaction internals, see Storage architecture.