Observability and traces
Agents are multi-step pipelines. The final output tells you what happened. Traces tell you how. Without them, you can’t answer: did the retriever call the right API? Did the agent retry too many times? Was the right model used? Did latency exceed the SLA? Did total token usage blow the budget?
Scouter does not make every evaluation depend on tracing. span.attach_eval() creates an EvalRecord and records correlation metadata. The record still follows the normal online eval path unless its registered profile includes TraceAssertionTask.
When a profile does include trace assertions, spans become part of the eval input. TraceAssertionTask runs assertions directly on OpenTelemetry span properties, so Scouter uses the inbox pattern below to wait for the specific anchor span before the poller runs that workflow.
For tracing setup and instrumentation, see Distributed tracing. For the full task type reference, see Building blocks.
For system-wide ingestion pipeline details, see Architecture.
OTEL compatibility and the Rust/Python boundary
Section titled “OTEL compatibility and the Rust/Python boundary”Scouter’s tracing is a standard OpenTelemetry implementation, not a proprietary format. ScouterInstrumentor implements the OTEL BaseInstrumentor interface and registers as the global TracerProvider. Any OTEL auto-instrumentation library (httpx, FastAPI, Starlette) works automatically once the instrumentor is active.
The architecture splits across two languages. Python owns the API surface: ScouterInstrumentor, get_tracer(), span context managers, attribute setting. Rust owns the performance path: span lifecycle, batching, export, and serialization. The boundary is PyO3. When you call tracer.start_as_current_span(), Python creates a thin wrapper around a Rust ActiveSpan. Attribute writes, event recording, and status updates cross the FFI boundary into Rust. The GIL is released during export.
graph LR
subgraph "Python (your application)"
SI["ScouterInstrumentor"] --> ST["ScouterTracer"]
ST --> SP["ScouterSpan (wrapper)"]
end
subgraph "Rust (PyO3)"
SP --> RS["ActiveSpan"]
RS --> BP["BatchSpanProcessor"]
BP --> SE["ScouterSpanExporter"]
BP --> OE["OtlpSpanExporter (optional)"]
end
subgraph "Export"
SE -->|"gRPC / HTTP / Kafka / RabbitMQ / Redis"| SV["Scouter server"]
OE -->|"OTLP"| EXT["External collector"]
end
Every span always goes to the Scouter server. Optionally, spans also export to an external OTEL collector (Jaeger, Datadog, Grafana Tempo) via HttpSpanExporter or GrpcSpanExporter. This dual-export design means you don’t have to choose between Scouter’s evaluation pipeline and your existing observability stack.
Span context propagation
Section titled “Span context propagation”W3C traceparent and baggage propagators are registered globally during instrumentation. Distributed traces across services work without additional configuration. ScouterTracingMiddleware (ASGI middleware) extracts incoming traceparent headers and starts server spans automatically for FastAPI/Starlette applications.
Outbound propagation is manual (get_tracing_headers_from_current_span()) or automatic via httpx auto-instrumentation.
Span enrichment
Section titled “Span enrichment”Beyond standard OTEL attributes, Scouter adds:
set_input(value)/set_output(value): capture function arguments and return values (truncated to configurable max length)set_entity(entity_id): link the span to an evaluation entity (profile UID)set_tag(key, value): indexed tags stored in PostgreSQL for fast lookupattach_eval(profile_uid, context, *, record_id, session_id, media, tags): insert anEvalRecordwith trace correlation, settingtrace_idandspan_idfrom the active span
For the full instrumentor API, see Instrumentor guide.
Instrumentation approaches: where Scouter differs
Section titled “Instrumentation approaches: where Scouter differs”Most LLM observability platforms auto-instrument by patching library internals at runtime. Datadog’s ddtrace wraps methods like openai.resources.chat.Completions.create directly. MLflow’s autologging uses the gorilla library to replace SDK methods in-place. Arize Phoenix’s OpenInference project uses wrapt.wrap_function_wrapper on OpenAI, Anthropic, and LangChain internals (it outputs OTEL spans, but the capture mechanism is still patching). Langfuse started with drop-in client wrappers (from langfuse.openai import OpenAI), though their SDK v3 has rebuilt on OTEL internally and now also accepts raw OTLP.
Patching works. It gives you zero-config tracing for supported libraries, and for prototyping that’s a real advantage. But it couples your observability to specific library versions. MLflow’s Anthropic support, for example, is explicitly documented as compatible with versions 0.43.1 through 0.76.0. When Anthropic restructures their SDK or LangChain refactors its chain execution path (both have happened), the patches break and you wait for the vendor to update. The agent ecosystem moves fast enough that this isn’t a theoretical concern.
Scouter doesn’t auto-instrument LLM calls. It provides a standard OTEL TracerProvider and expects your application (or your framework’s native OTEL support) to emit spans with gen_ai.* attributes. The contract is the OTEL spec, not any library’s internal API. Framework upgrades don’t break tracing. A new framework works on day one if it emits standard spans.
| Auto-instrumentation (patching) | OTEL-native (Scouter’s approach) | |
|---|---|---|
| Setup effort | Zero-config for supported libraries | You set span attributes, or rely on framework OTEL support |
| Library coupling | Patches target specific SDK methods and versions | None. OTEL spec is the only contract. |
| Version resilience | Breaks when patched internals change | Survives library upgrades |
| Coverage | Captures all calls to supported libraries automatically | Only captures what you (or your framework) instruments |
| Multi-framework | Each framework needs its own patch set | Works across any OTEL-compliant framework |
| Maintenance burden | Vendor must track every SDK release | Zero per-framework maintenance |
The trade-off is real. If your framework doesn’t emit OTEL spans natively yet, you write a few lines of attribute-setting per LLM call. For production systems where you control the instrumentation, we think that’s the right trade. For quick prototyping against a single provider, auto-instrumentation is faster to start with. Both are valid choices depending on where you are in the lifecycle.
Full OTEL GenAI semantic conventions
Section titled “Full OTEL GenAI semantic conventions”Scouter implements the OpenTelemetry GenAI semantic conventions across its entire storage and query layer. The GenAI span schema stores ~45 columns that map directly to the OTEL GenAI specification. This gives Scouter parity with any agentic framework that emits OTEL-compatible GenAI spans.
If your framework (LangChain, CrewAI, Google ADK, or anything else) produces standard gen_ai.* span attributes, Scouter stores them natively in a columnar format optimized for analytical queries. No attribute flattening, no JSON blob storage, no loss of structure.
The attributes stored as dedicated columns:
| Category | Attributes |
|---|---|
| Operation | gen_ai.operation.name, gen_ai.provider.name |
| Model | gen_ai.request.model, gen_ai.response.model, gen_ai.response.id |
| Token usage | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_creation.input_tokens, gen_ai.usage.cache_read.input_tokens, gen_ai.usage.reasoning.output_tokens |
| Request params | gen_ai.request.temperature, gen_ai.request.max_tokens, gen_ai.request.top_p, gen_ai.request.top_k, gen_ai.request.stream, gen_ai.request.encoding_formats, gen_ai.request.seed, gen_ai.request.frequency_penalty, gen_ai.request.presence_penalty, gen_ai.request.stop_sequences, gen_ai.request.choice.count |
| Response | gen_ai.response.finish_reasons, gen_ai.response.time_to_first_chunk, gen_ai.output.type |
| Agent | gen_ai.agent.name, gen_ai.agent.id, gen_ai.agent.description, gen_ai.agent.version |
| Conversation | gen_ai.conversation.id, gen_ai.data_source.id |
| Tool calls | gen_ai.tool.name, gen_ai.tool.type, gen_ai.tool.description, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result |
| Prompt / Workflow | gen_ai.prompt.name, gen_ai.workflow.name |
| Embeddings | gen_ai.embeddings.dimension.count |
| Retrieval | gen_ai.retrieval.documents, gen_ai.retrieval.query.text |
| Content (opt-in) | gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions, gen_ai.tool.definitions |
| Server | server.address, server.port |
| Error | error.type |
| Evaluation | gen_ai.evaluation.result events with name, score.label, score.value, explanation |
See the GenAI semantic conventions page for Google ADK / Vertex vendor fallback details and sensitive-content redaction rules.
These attributes are set by your application code on spans. Scouter doesn’t auto-instrument LLM calls; your framework or manual instrumentation provides the attributes. What Scouter does is store them in a columnar layout where each attribute is a typed column with dictionary encoding on high-repetition fields (provider_name, request_model, response_model, operation_name, output_type). Token columns use delta encoding. The result: queries like “average input tokens by model over the last 24 hours” scan compressed, sorted columns instead of parsing JSON blobs.
GenAI spans live in a separate Delta Lake table (gen_ai_spans) alongside the base trace span table (trace_spans). Both tables share trace_id and span_id for correlation. The separation keeps the base trace schema lean while giving the GenAI table freedom to evolve with the OTEL specification without schema conflicts.
Trace write path
Section titled “Trace write path”Spans flow through five stages from your application to Delta Lake storage.
graph LR
subgraph "Application"
A["span.end()"]
end
subgraph "Rust exporter"
A --> B["BatchSpanProcessor<br/>512 spans / 5s flush"]
B --> C["ScouterSpanExporter<br/>→ ExportTraceServiceRequest"]
end
subgraph "Transport"
C --> D["gRPC / HTTP / Kafka /<br/>RabbitMQ / Redis"]
end
subgraph "Server ingestion"
D --> R["Message router"]
R --> E["Trace channel<br/>flume (bounded: 500)"]
R --> E2["Server records channel<br/>flume (bounded: 1000)"]
R --> E3["Tag channel<br/>flume (bounded: 200)"]
end
subgraph "Trace workers"
E --> F["MessageHandler"]
F --> G["TraceSpanService<br/>(buffer actor)"]
F --> H["GenAiSpanService<br/>(if gen_ai attrs present)"]
F --> I["PostgreSQL<br/>(baggage)"]
G --> J["TraceSpanDBEngine<br/>(Delta Lake writer)"]
H --> K["GenAiSpanDBEngine<br/>(Delta Lake writer)"]
end
-
Your span ends. The Rust
BatchSpanProcessorqueues it. When 512 spans accumulate or 5 seconds pass, the batch exports. -
ScouterSpanExporterconverts the OTELSpanDatabatch into a singleExportTraceServiceRequest(gRPC protobuf format). This is the standard OTLP wire format. -
The serialized request hits the server through whatever transport you configured. The server’s message router dispatches it to one of three dedicated flume channels based on record type: trace records (bounded at 500), server records like drift profiles (bounded at 1000), and tag records (bounded at 200). Each channel has its own worker pool, so trace ingestion never competes with drift record processing for channel capacity.
-
Trace workers parse the OTLP request into internal types:
TraceSpanRecord(span data),TraceBaggageRecord(W3C baggage), andTagRecord(indexed tags). Span data goes to Delta Lake. Baggage goes to PostgreSQL. -
If the span has
gen_ai.*attributes, aGenAiSpanRecordis extracted and sent to the GenAI storage engine in parallel.
The buffer and engine actors
Section titled “The buffer and engine actors”Spans don’t write to Delta Lake one at a time. A buffer actor accumulates spans and flushes on a size threshold or a 5-second timer. The flushed batch goes to the engine actor, which converts spans to an Arrow RecordBatch and commits to Delta Lake.
Each pod runs a single engine actor that serializes all local writes through an mpsc channel. Across pods, Delta Lake’s optimistic concurrency control handles concurrent appends to the same table on shared object storage. Multiple pods can write spans simultaneously without coordination; the Delta log resolves conflicts via OCC.
Compaction is a different story. Only one pod should compact at a time, and the ControlTableEngine (a Delta-backed coordination table) enforces this via distributed task claiming. Post-compaction vacuum runs with retention_hours=0, which assumes no concurrent reader is referencing an older table version. This aggressive vacuum is safe under the current architecture because every pod refreshes its snapshot on a short interval (SCOUTER_TRACE_REFRESH_INTERVAL_SECS, default 10 seconds), but it does mean the refresh interval sets a lower bound on vacuum safety.
After each write, the engine does an atomic catalog.swap() on the DataFusion SessionContext. This replaces the TableProvider in a single operation with no deregister/register gap. Queries in flight see the old provider (consistent); new queries see the fresh data.
What gets stored at ingest vs. query time
Section titled “What gets stored at ingest vs. query time”The search_blob column is pre-computed at ingest: service name, span name, and flattened attributes concatenated into a single string. Attribute filter queries use LIKE '%key:value%' with a custom match_attr UDF. No JSON parsing at query time.
Span hierarchy fields (depth, span_order, path, root_span_id) are not stored. They’re computed at query time via a Rust DFS traversal. This matches how Jaeger and Zipkin handle it, and avoids ingest ordering dependencies (child spans can arrive before parents).
Trace read path
Section titled “Trace read path”Queries go through DataFusion, which handles partition pruning, bloom filter evaluation, and predicate pushdown against the Delta Lake table.
sequenceDiagram
participant C as Client
participant API as API route
participant DF as DataFusion SessionContext
participant DL as Delta Lake (object store)
participant CS as CachingStore
C->>API: GET /scouter/v1/traces/{trace_id}
API->>DF: SELECT * FROM trace_spans WHERE trace_id = ? AND start_time BETWEEN ? AND ?
DF->>DL: Partition pruning (partition_date)
DF->>CS: Row group pruning (bloom filter on trace_id)
CS-->>DF: Cached or fetched Parquet ranges
DF-->>API: RecordBatch results
API->>API: build_span_tree() (Rust DFS)
API-->>C: Vec of Span with hierarchy
Three pruning layers fire in sequence:
-
Partition pruning. The
partition_datecolumn (Hive-style partitioning) eliminates entire date directories. A query for “last 2 hours” skips all other days. -
Bloom filter pruning. Each row group carries a bloom filter on
trace_id(FPP 0.01). For single-trace lookups, this skips ~99% of row groups within the surviving partitions. DataFusion’sreorder_filters=trueensures the bloom filter evaluates before more expensive predicates. -
Column statistics. Page-level min/max statistics on
start_timeandstatus_codefurther narrow the scan within surviving row groups.
After DataFusion returns the matching RecordBatch, a Rust DFS traversal builds the span tree: assigning depth, ordering, paths, and root span IDs. Deep traces (hundreds of spans) will be slower to assemble, but typical agent traces have tens of spans.
Caching
Section titled “Caching”A CachingStore wraps the object store (S3, GCS, Azure, or local) with two caches:
- HEAD cache: file metadata lookups. 10,000 entries, 1-hour TTL.
- Range cache: Parquet byte ranges up to 2 MB. Sized by configurable memory budget, 1-hour TTL. Covers footer reads and column index lookups.
After Z-ORDER compaction, Parquet files are immutable, so caching is safe. Delete operations invalidate the relevant cache entries.
Storage architecture
Section titled “Storage architecture”Trace data lives in Delta Lake on any supported object store. Two tables, same storage path, same optimization strategy.
Object store backends
Section titled “Object store backends”| Backend | Config | Notes |
|---|---|---|
| Local filesystem | SCOUTER_STORAGE_URI=./scouter_storage | Default. Good for development. |
| Amazon S3 | SCOUTER_STORAGE_URI=s3://bucket/path | Uses AmazonS3Builder::from_env(). Region via settings. |
| Google Cloud Storage | SCOUTER_STORAGE_URI=gs://bucket/path | Credentials via GOOGLE_ACCOUNT_JSON_BASE64. |
| Microsoft Azure | SCOUTER_STORAGE_URI=az://container/path | Dual env var naming supported (AZURE_STORAGE_ACCOUNT_NAME or AZURE_STORAGE_ACCOUNT). |
Cloud connections use pooled HTTP clients: 64 max idle connections per host, 120-second idle timeout, 30-second request timeout.
Two Delta tables
Section titled “Two Delta tables”| Table | Purpose | Key columns | Partition |
|---|---|---|---|
trace_spans | All spans from all services | trace_id, span_id, parent_span_id, service_name, span_name, start_time, duration_ms, attributes, events, links, input, output, search_blob | partition_date |
gen_ai_spans | GenAI-specific span data | trace_id, span_id, provider_name, request_model, response_model, input_tokens, output_tokens, operation_name, agent_name, tool_name, conversation_id, + ~30 more | partition_date |
Both tables correlate on (trace_id, span_id). The base table stores everything; the GenAI table stores the typed, columnar representation of gen_ai.* attributes that would otherwise be buried in a JSON attributes map.
Parquet optimization
Section titled “Parquet optimization”| Setting | Value | Why |
|---|---|---|
| Row group size | 32,768 rows | ~4 groups per 128 MB file. Enough granularity for bloom filter and statistics pruning. |
| Bloom filters | trace_id (FPP 0.01, NDV 32,768), service_name, conversation_id | Single-trace lookups skip 99% of row groups. |
| Column statistics | Page-level on start_time, status_code | Finest-grained time pruning within row groups. |
| Delta encoding | start_time, duration_ms, input_tokens, output_tokens | 4-8x compression on sorted numeric sequences. |
| Dictionary encoding | service_name, span_name, provider_name, request_model, response_model, operation_name, output_type | Low-cardinality string columns compress to integer indices. |
| Compression | ZSTD level 3 | ~40% better than Snappy at marginal decompression cost. |
Compaction and lifecycle
Section titled “Compaction and lifecycle”Z-ORDER compaction rewrites Parquet files sorted by a space-filling curve on (start_time, service_name). This co-locates data from the same service and time window, which makes time-bounded queries per service scan fewer files.
Compaction runs on a schedule controlled by the ControlTableEngine, a Delta Lake table that coordinates background tasks across pods. Each task (compaction, vacuum, retention) uses Delta’s optimistic concurrency to claim ownership:
- Pod calls
try_claim_task(). Delta OCC ensures only one pod wins. - Winner runs compaction (target: 128 MB files). Stale claims (older than 30 minutes) are auto-reclaimed.
- Immediate vacuum follows (retention_hours=0). This is aggressive but safe because all pods refresh their Delta snapshot on a short interval, so no pod holds references to stale files for long.
- Pod releases the task and schedules the next run.
Data retention uses a logical delete (DELETE WHERE partition_date < cutoff) followed by a vacuum to reclaim disk space. Configurable via DATA_RETENTION_PERIOD (default 30 days).
Multi-pod deployment
Section titled “Multi-pod deployment”In a multi-pod deployment, every pod runs its own engine actor and can write spans. Delta Lake OCC handles concurrent appends. Each pod also refreshes its in-memory snapshot on an interval to pick up commits from other pods:
| Variable | Default | Effect |
|---|---|---|
SCOUTER_TRACE_REFRESH_INTERVAL_SECS | 10 | How often each pod calls update_incremental() to pick up new commits from shared object storage. |
Lower values mean faster cross-pod visibility. Higher values reduce object-store LIST calls. The refresh updates the DataFusion TableProvider via an atomic catalog.swap(). This interval also sets the floor for vacuum safety: if a pod hasn’t refreshed before vacuum runs, it could hold references to deleted files. At the default 10-second interval, this is not a practical concern.
DataFusion session configuration
Section titled “DataFusion session configuration”The DataFusion SessionContext is tuned for trace workloads:
| Setting | Value | Purpose |
|---|---|---|
| Target partitions | CPU cores (min 4) | Parallel execution across row groups |
| Batch size | 8,192 rows | In-memory record batch buffer |
| Parquet pruning | enabled | Row group filtering via statistics |
| Pushdown filters | enabled | Push predicates into Parquet reader |
| Reorder filters | enabled | Evaluate bloom filters before expensive predicates |
| Bloom filter on read | enabled | Consult bloom filters before decoding row groups |
| Schema force view types | enabled | Keep Utf8View/BinaryView zero-copy |
| Metadata size hint | 1 MB | Reduce cloud round-trips for Parquet footers |
| Meta fetch concurrency | 64 | Parallel file stats fetching (matches connection pool) |
Trace-aware evaluation architecture
Section titled “Trace-aware evaluation architecture”Evaluation does not depend on tracing by default. span.attach_eval() creates an EvalRecord, fills in trace_id and span_id from the active span, and stamps the span with scouter.eval.record_uid and scouter.eval.profile_uid. That is correlation metadata. It does not, by itself, make the record wait for trace storage.
The server decides whether trace availability gates the workflow by reading the registered AgentEvalProfile. If the profile has no TraceAssertionTask, the record is inserted as pending immediately and evaluated like any other online record. The trace IDs remain useful for later lookup, but they do not slow down evaluation.
The inbox pattern exists for one case: profiles that include TraceAssertionTask. Those tasks read spans from Delta Lake, so Scouter must make sure the specific anchor span is committed before the poller runs the workflow. For those profiles, readiness means the exact (record_uid, trace_id, span_id) anchor has committed to Delta Lake, not merely that some span with the same trace_id exists.
The important split for trace-assertion workflows is this: Delta Lake is the source of truth, PostgreSQL is the eval workflow state machine, and the in-memory channel is only a latency path. If the channel drops an anchor event, correctness still comes from reconciliation scanning Delta and recreating the queue row.
End-to-end inbox flow
Section titled “End-to-end inbox flow”flowchart TD
subgraph APP["Application process"]
SP["Active span"]
AE["span.attach_eval()<br/>creates EvalRecord<br/>stamps span metadata"]
ER["EvalRecord<br/>context + trace_id + span_id"]
END["span.end()"]
SP --> AE --> ER
SP --> END
end
subgraph EVAL_INGEST["Eval record ingestion"]
MR["Message router"]
INSERT["insert agent_eval_record"]
NEEDS_TRACE{"workflow includes<br/>TraceAssertionTask?"}
PROBE["probe trace_commit_event<br/>by anchor tuple"]
FAIL_TRACE["failed(EvalRequiresTrace)"]
FAIL_ANCHOR["failed(EvalRequiresAnchorSpan)"]
FAIL_TIMEOUT["failed(TraceArrivalTimeout)"]
AWAIT["agent_eval_record<br/>status = awaiting_trace"]
PENDING_FAST["agent_eval_record<br/>status = pending"]
PENDING_TRACE["agent_eval_record<br/>status = pending<br/>ready_at = now + visibility buffer"]
end
subgraph TRACE_INGEST["Trace ingestion"]
EXPORT["BatchSpanProcessor / exporter"]
TRACE_CH["trace channel"]
BUFFER["TraceSpanService<br/>buffer actor"]
ENGINE["TraceSpanDBEngine<br/>Arrow + Delta append"]
EXTRACT["extract committed anchors<br/>only spans with record_uid + profile_uid"]
end
subgraph STORES["Durable stores"]
DELTA["Delta Lake trace_spans<br/>source of truth"]
QUEUE["PostgreSQL trace_commit_event<br/>durable queue + audit log"]
end
subgraph LIVE["Live latency path"]
COMMIT_TX["bounded commit_tx<br/>try_send, may drop"]
WRITER["inbox writer<br/>ON CONFLICT DO NOTHING"]
end
subgraph WORKERS["Background workers"]
DRAIN["inbox drain<br/>claim pending rows<br/>FOR UPDATE SKIP LOCKED"]
COMPLETE["complete owned rows<br/>claim_token fenced"]
RECON["reconciliation sweep<br/>old awaiting_trace rows"]
LEASE["lease recovery<br/>stale processing rows"]
TIMEOUT["timeout sweep<br/>TraceArrivalTimeout"]
POLLER["AgentPoller<br/>SELECT pending rows"]
EVAL["task DAG + TraceAssertionTask"]
RESULTS["workflow results + alerts"]
end
ER --> MR --> INSERT --> NEEDS_TRACE
NEEDS_TRACE -->|"no"| PENDING_FAST
NEEDS_TRACE -->|"yes, missing trace_id"| FAIL_TRACE
NEEDS_TRACE -->|"yes, trace_id present,<br/>missing span_id"| FAIL_ANCHOR
NEEDS_TRACE -->|"yes, complete anchor"| PROBE
PROBE -->|"exact anchor queued"| PENDING_TRACE
PROBE -->|"not seen yet"| AWAIT
END --> EXPORT --> TRACE_CH --> BUFFER --> ENGINE --> DELTA
ENGINE --> EXTRACT --> COMMIT_TX
COMMIT_TX -->|"delivered"| WRITER --> QUEUE
COMMIT_TX -.->|"dropped / pod died"| RECON
AWAIT --> RECON
RECON -->|"query Delta by record_uid,<br/>trace_id, span_id, time window"| DELTA
RECON -->|"insert missing anchors"| QUEUE
QUEUE --> DRAIN --> COMPLETE
COMPLETE -->|"flip matching anchor tuple only"| PENDING_TRACE
QUEUE --> LEASE -->|"retry or dead_lettered"| QUEUE
AWAIT --> TIMEOUT -->|"older than timeout"| FAIL_TIMEOUT
PENDING_FAST --> POLLER
PENDING_TRACE -->|"after ready_at"| POLLER --> EVAL
EVAL -->|"fetch spans by trace_id"| DELTA
EVAL --> RESULTS
Read the diagram left to right by concern:
- Eval record ingestion writes
agent_eval_record. If the workflow has noTraceAssertionTask, the record goes straight topending. If trace assertions are present, ingestion never guesses that a trace is ready because some span in the trace exists; it either finds the exact anchor tuple or leaves the record inawaiting_trace. - Trace ingestion writes spans to Delta first. Only after the Delta commit succeeds does the engine extract anchor spans and send them toward the inbox writer.
- The live path is deliberately best-effort.
commit_txis bounded and usestry_send; dropping an event is acceptable because the event can be rebuilt from Delta. - The durable queue is
trace_commit_event. The inbox worker claims queue rows, completes the rows it owns, and flips only eval rows that match the returned(record_uid, trace_id, span_id)anchor. - Reconciliation is the correctness path. It scans older
awaiting_tracerows, queries Delta for their anchor spans, and inserts any missing queue rows. It does not updateagent_eval_recorddirectly. - The poller evaluates pending records. Only records that were gated for trace assertions receive the
ready_atvisibility buffer; normal eval records do not wait on tracing.
Two server-side paths after insertion
Section titled “Two server-side paths after insertion”Both examples below create eval records for the same online evaluation system. The difference is whether the record includes trace correlation metadata.
Trace-correlated record:
with tracer.start_as_current_span("agent.callback") as span: span.attach_eval( profile_uid=profile.config.uid, context={"query": query, "response": response}, record_id="turn-7", # optional: scenario/turn/step ID session_id="sess-123", # optional: session/conversation ID media=None, # optional: multimodal evidence tags=["run=abc"], # optional: key=value tags )attach_eval()creates theEvalRecord; the server still owns workflow routing.- If the registered profile has no
TraceAssertionTask, the record is inserted aspending. - If the profile has
TraceAssertionTask, the server checkstrace_commit_eventfor the exact(record_uid, trace_id, span_id, profile_uid)anchor. - If the anchor queue row already exists, the record is inserted as
pendingwith a visibility delay. - If the anchor has not been delivered yet, the record is inserted as
awaiting_trace.
Content-only record:
queue.insert(EvalRecord( profile_uid=profile.config.uid, context={"query": query, "response": response}, record_id="turn-7", session_id="sess-123",))- No
trace_idorspan_idis attached. - If the profile has no
TraceAssertionTask, the record is inserted aspending. - If the profile has
TraceAssertionTask, the record fails withfailed(EvalRequiresTrace)because the workflow asked for span data that this record cannot provide.
Trace arrival coordination
Section titled “Trace arrival coordination”The trace_commit_event table is a durable queue and audit log for anchor spans. It is only used to unblock eval records waiting on trace assertions. Rows move through pending → processing → processed | dead_lettered and carry attempt_count, claimed_at, claimed_by, claim_token, and last_error. The inbox worker claims rows with FOR UPDATE SKIP LOCKED, completes owned queue rows, then flips only eval rows that match the returned anchor tuple.
Delta Lake remains the source of truth. The live channel from the engine actor is a latency optimization: it uses bounded try_send, may drop when full or closed, and increments scouter_trace_commit_event_channel_drop_total. A reconciliation sweep recovers correctness by scanning old awaiting_trace eval rows, querying Delta with bounded time predicates, and inserting missing queue rows with ON CONFLICT DO NOTHING. Reconciliation never writes agent_eval_record.
Failure modes:
| Scenario | Behavior |
|---|---|
| Record has trace IDs but workflow has no trace assertions | Record inserts as pending; tracing is correlation metadata only |
| Anchor commits before record arrives | Short-circuit: record inserted as pending by probing the exact anchor |
Anchor commits after record inserted as awaiting_trace | Inbox worker completes the owned queue row, then flips the eval row that matches (record_uid, trace_id, span_id) to pending |
| Live anchor channel drops | Reconciliation finds the anchor in Delta and inserts the missing queue row |
| Worker dies while processing | Lease recovery returns the row to pending or moves it to dead_lettered after max attempts |
| Trace never commits (client crashed, network partition) | Record remains awaiting_trace for 5 minutes, then marked failed(TraceArrivalTimeout) |
| Profile requires trace assertions but record has no trace_id | Record fails immediately with failed(EvalRequiresTrace) |
| Profile requires trace assertions and record has trace_id but no span_id | Record fails immediately with failed(EvalRequiresAnchorSpan) |
Scheduled maintenance
Section titled “Scheduled maintenance”Inbox worker: Claims pending queue rows, completes rows fenced by claim_token, and flips only eval rows matching the returned (record_uid, trace_id, span_id) anchors to pending.
Reconciliation sweep: Scans old awaiting_trace rows and queries Delta for their anchor spans. It inserts missing trace_commit_event rows only.
Lease recovery sweep: Recovers stuck processing rows after the lease TTL. Rows are retried until max_attempts, then kept as dead_lettered for operator triage.
Timeout sweep: Background task that scans awaiting_trace rows older than 5 minutes (configurable via TRACE_ARRIVAL_TIMEOUT_SECS). Marks them as failed(TraceArrivalTimeout) and emits a metric for alerting.
Prune sweep: Removes old processed queue rows after 24 hours. Pending, processing, and dead-lettered rows are not pruned automatically.
Operators should alert on sustained channel drops, non-zero dead-lettered rows, rising pending count, and oldest pending age above the reconciliation interval. The main metrics are scouter_trace_commit_event_pending_count, scouter_trace_commit_event_oldest_pending_age_seconds, scouter_trace_commit_event_channel_drop_total, scouter_trace_commit_event_reconciled_total, scouter_trace_commit_event_dead_lettered_total, scouter_trace_commit_event_lease_recovered_total, and the claim/flip histograms.
Migration note: 20260512000000_trace_commit_anchor_queue.sql is forward-only. Stop the server during the migration; the old trace-level inbox rows are intentionally dropped and any remaining awaiting_trace rows are recovered by reconciliation or failed by the timeout sweep.
Configuration
Section titled “Configuration”| Variable | Default | Effect |
|---|---|---|
TRACE_ARRIVAL_TIMEOUT_SECS | 300 | How long to wait for a trace to commit before failing the record |
INBOX_POLL_INTERVAL_SECS | 10 | How often the inbox worker checks for new commit events |
TRACE_COMMIT_EVENT_RETENTION_HOURS | 24 | How long to keep processed commit event rows before pruning |
SCOUTER_TRACE_REFRESH_INTERVAL_SECS | 10 | How often each pod refreshes its Delta snapshot |
SCOUTER_TRACE_VISIBILITY_BUFFER_SECS | refresh + 2 | Delay before pending trace evals are polled; startup fails if below refresh + 2 |
SCOUTER_INBOX_RECONCILE_AFTER_SECS | 15 | How old an awaiting_trace row must be before reconciliation scans for it |
SCOUTER_INBOX_RECONCILE_LOOKBACK_SECS | 86400 | Maximum supported anchor span start lookback used by reconciliation Delta queries |
SCOUTER_INBOX_RECONCILE_INTERVAL_SECS | 60 | Reconciliation tick interval |
SCOUTER_INBOX_RECONCILE_BATCH | 200 | Awaiting rows scanned per reconciliation tick |
Same data, unified evaluation
Section titled “Same data, unified evaluation”Scouter’s observability engine and evaluation pipeline share the same trace data, but only trace-assertion workflows are gated on it. Spans in Delta Lake feed TraceAssertionTask evaluations. Records for profiles without trace assertions can still carry trace IDs for lookup, but they are evaluated without waiting for trace storage.
For trace-assertion workflows, the trace storage, evaluation engine, and alert pipeline become one data path: spans arrive, anchor events unblock matching records, evaluations run, results get stored, and alerts fire.
For Delta Lake schema details and compaction internals, see Storage architecture.