Skip to content

Storage architecture

Scouter’s server stores OTel spans in a Delta Lake + Apache DataFusion columnar engine, replacing Postgres-only trace storage. Spans are ingested at high throughput via a dual-actor write pipeline and queried via DataFusion SQL — giving you sub-second reads, efficient compaction, and cloud-native object store support.


flowchart LR
    subgraph CL ["Client"]
        APP["Application code"]
        EX[ScouterSpanExporter]
    end

    subgraph SI ["Server ingestion"]
        TP["Transport layer"]
        WK["Consumer worker"]
        MH[MessageHandler]
    end

    subgraph SL ["Storage layer"]
        BUF["Buffer actor\n10K cap · 5s flush"]
        ENG["Engine actor\nsingle writer"]
        TS[("Delta Lake\ntrace_spans")]
        SUM["TraceSummaryService\nwrite_summaries"]
        TSM[("Delta Lake\ntrace_summaries")]
    end

    APP --> EX
    EX --> TP --> WK --> MH
    MH -->|"write_spans"| BUF
    BUF -->|"TableCommand::Write"| ENG
    ENG -->|"build_batch\nappend"| TS
    MH -->|"write_summaries"| SUM
    SUM --> TSM

Every span batch is processed by the MessageHandler, which fans out to two Delta Lake destinations:

  1. Delta Lake trace_spans — full span data via the dual-actor buffer/engine pipeline
  2. Delta Lake trace_summaries — one row per trace, updated as spans arrive via TraceCache (accumulates span updates and flushes per-trace summaries on flush)

ComponentCrateFilePurpose
TraceSpanServicescouter_dataframeparquet/tracing/service.rsGlobal singleton; owns both actors and the query service
TraceSpanDBEnginescouter_dataframeparquet/tracing/engine.rsDelta Lake single-writer actor
TraceSpanBatchBuilderscouter_dataframeparquet/tracing/engine.rsZero-copy Arrow serialization of TraceSpanRecord
TraceQueriesscouter_dataframeparquet/tracing/queries.rsDataFusion query execution and DFS span tree assembly
TraceSummaryServicescouter_dataframeparquet/tracing/summary.rsHour-bucketed summary table with cursor pagination
CachingStorescouter_dataframecaching_store.rsObjectStore wrapper; caches head() and small range reads (≤2 MB) for immutable Parquet files
ObjectStorescouter_dataframestorage.rsConstructs the concrete cloud/local store wrapped in CachingStore; builds the tuned SessionContext
Transport Layer (gRPC/HTTP/Kafka/RabbitMQ/Redis)scouter_server / scouter_eventsapi/grpc/message.rs, api/routes/, events/Span ingestion handlers across all supported transports
MessageHandlerscouter_sqlsql/postgres.rsConsumer worker; routes spans to TraceSpanService and TraceSummaryService

TraceSpanService starts two long-lived Tokio tasks on initialization:

Collects incoming TraceSpanRecord batches in memory and triggers a flush when either condition is met:

  • Capacity: buffer reaches 10,000 spans
  • Time: flush interval elapses (default: 5 seconds, configurable via SCOUTER_TRACE_FLUSH_INTERVAL_SECS)

On flush, the buffer actor drains itself and sends a TableCommand::Write message to the engine actor, then awaits acknowledgment via a oneshot channel.

The single writer for the Delta Lake table. It:

  1. Receives TableCommand::Write { spans, respond_to }
  2. Calls build_batch() on TraceSpanBatchBuilder to produce an Arrow RecordBatch
  3. Calls write_spans() — acquires a write lock, refreshes the table state, appends the batch
  4. Re-registers the updated table in the shared SessionContext so queries immediately see new data
  5. Sends Ok(()) back on the oneshot channel

Why this pattern? Delta Lake requires a single writer per table to avoid log conflicts. The two-actor design amortizes object-store I/O (many small span batches → fewer larger Parquet files per flush) while keeping the write lock duration minimal.

The engine actor runs two background tokio::time::interval tickers alongside command processing:

loop {
select! {
cmd = rx.recv() => handle Write / Optimize / Vacuum / Shutdown
_ = compaction_ticker => run Z-ORDER optimize automatically
_ = refresh_ticker => call update_incremental(); re-register SessionContext if version advanced
}
}

The refresh ticker is what keeps reader pods current in multi-pod deployments. See Multi-Pod Deployments below.


Stores one row per span. Hierarchy fields (depth, span_order, path, root_span_id) are not stored — they are computed at query time by build_span_tree() via Rust DFS traversal. This matches the Jaeger/Zipkin model and avoids ordering dependencies during ingest (spans may arrive out-of-order within a batch).

ColumnArrow TypeNullableNotes
trace_idFixedSizeBinary(16)NoW3C 128-bit trace ID
span_idFixedSizeBinary(8)NoW3C 64-bit span ID
parent_span_idFixedSizeBinary(8)YesNull on root spans
flagsInt32NoW3C trace flags
trace_stateUtf8NoW3C trace state header
scope_nameUtf8NoInstrumentation scope
scope_versionUtf8YesInstrumentation scope version
service_nameDictionary<Int32, Utf8>NoDictionary-encoded; high repetition
span_nameUtf8NoOperation name
span_kindDictionary<Int8, Utf8>YesDictionary-encoded; SERVER/CLIENT/etc.
start_timeTimestamp(Microsecond, UTC)NoSub-millisecond precision
end_timeTimestamp(Microsecond, UTC)NoSub-millisecond precision
duration_msInt64NoPre-computed for fast aggregation
status_codeInt32NoOTel status code (0=Unset, 1=OK, 2=Error)
status_messageUtf8Yes
labelUtf8YesScouter-specific span label
attributesMap<Utf8, Utf8View>NoKey-value span attributes
resource_attributesMap<Utf8, Utf8View>YesResource-level attributes
eventsList<Struct{name, timestamp, attributes, dropped_count}>NoSpan events
linksList<Struct{trace_id, span_id, trace_state, attributes, dropped_count}>NoSpan links
inputUtf8ViewYesCaptured function input (JSON)
outputUtf8ViewYesCaptured function output (JSON)
search_blobUtf8ViewNoPre-computed search string
partition_dateDate32NoHive partition column (days since Unix epoch); derived from start_time at ingest

Key design decisions:

DecisionRationale
Hierarchy not storedAvoids ordering dependencies during ingest; DFS is computed in Rust at query time
Dictionary on service_name, span_kindHigh repetition across spans → significant compression savings
Utf8View for input, output, search_blobLarge JSON payloads; Arrow StringView reduces heap copies for long strings
search_blob pre-computed at ingestConcatenates service name, span name, scope name, attributes, and events into a single string — avoids JSON re-parsing on every attribute filter query
Timestamp(Microsecond, UTC)Sub-millisecond precision with explicit timezone; matches OTel wire format
FixedSizeBinary for IDsCompact binary representation; avoids hex string parsing in hot path
partition_date Hive partitionEnables partition pruning — queries filtered by time skip entire date directories without reading the Delta log
DataSkippingStatsColumns restrictedDelta Lake collects min/max stats only for start_time, end_time, service_name, duration_ms, status_code, partition_date — avoids inflating the Delta log with statistics on wide map/list columns
Bloom filters on trace_id, service_name, span_nameWritten at Parquet row-group level; skip ~99% of row groups for equality lookups in the hot query path

Stores one row per trace, hour-bucketed. Used exclusively for the paginated trace list — avoiding a full trace_spans scan for list views.

ColumnArrow TypeNullable
trace_idFixedSizeBinary(16)No
service_nameDictionary<Int32, Utf8>No
scope_nameUtf8No
scope_versionUtf8Yes
root_operationUtf8No
start_timeTimestamp(Microsecond, UTC)No
end_timeTimestamp(Microsecond, UTC)Yes
duration_msInt64Yes
status_codeInt32No
status_messageUtf8Yes
span_countInt64No
error_countInt64No
resource_attributesMap<Utf8, Utf8View>Yes

Retrieves all spans for a single trace and assembles them into a hierarchy tree.

  1. Time filter applied first — enables Delta Lake statistics-based file pruning (skips Parquet files whose start_time range doesn’t overlap the query window)
  2. trace_id filter — binary equality pushdown
  3. RecordBatchFlatSpan — zero-copy column extraction from Arrow arrays
  4. build_span_tree() — Rust DFS traversal assigns depth, span_order, path, and root_span_id in-memory; orphan spans (parent not in batch) are appended at the end

Returns time-bucketed aggregates using a DataFusion CTE pipeline:

WITH
-- Optional: pre-filter by attribute (LIKE on search_blob)
matching_traces AS (
SELECT DISTINCT trace_id FROM trace_spans
WHERE start_time >= ? AND start_time < ?
AND (search_blob LIKE '%key:value%' OR ...)
),
-- Aggregate per-trace: duration = MAX(end_time) - MIN(start_time)
trace_level AS (
SELECT trace_id, MIN(start_time), MAX(end_time),
MAX(CASE WHEN parent_span_id IS NULL THEN service_name END) AS root_service,
MAX(status_code)
FROM trace_spans WHERE ... GROUP BY trace_id
),
service_filtered AS (...),
bucketed AS (SELECT DATE_TRUNC('hour', trace_start), duration_ms, status_code ...)
SELECT
bucket_start,
COUNT(*) AS trace_count,
AVG(duration_ms), approx_percentile_cont(duration_ms, 0.50),
approx_percentile_cont(duration_ms, 0.95),
approx_percentile_cont(duration_ms, 0.99),
AVG(CASE WHEN status_code = 2 THEN 1.0 ELSE 0.0 END) AS error_rate
FROM bucketed GROUP BY bucket_start ORDER BY bucket_start

Attribute filters use search_blob LIKE '%key:value%' — no JSON parsing at query time.

Reads from trace_summaries using cursor-based pagination on (start_time, trace_id). The cursor avoids full table scans on large datasets:

  • Forward: (start_time, trace_id) < (cursor_start, cursor_id) LIMIT n
  • Backward: (start_time, trace_id) > (cursor_start, cursor_id) LIMIT n (then reverse)

Trace data goes through a three-phase lifecycle:

The buffer actor writes small Parquet files on every flush (every 10K spans or 5 seconds). File sizes depend on span payload sizes but are typically small immediately after ingest.

The engine actor runs compaction automatically on a configurable interval (default: every 24 hours). Compaction uses Delta Lake Z-ORDER on (start_time, service_name):

  • Target file size: 128 MB
  • Z-ORDER columns: start_time (time-range queries), service_name (service filter pushdown)
  • WriterProperties preserved: bloom filters on trace_id, service_name, span_name are re-specified explicitly — without this, every compaction cycle would silently discard them from rewritten files

Z-ORDER co-locates spans with similar start times and service names within each Parquet file, maximizing the effectiveness of DataFusion’s min/max statistics-based file pruning. After Z-ORDER, delta-encoded timestamps compress 4–8x within each row group.

Compaction is coordinated across pods via a control table — an optimistic-concurrency PostgreSQL table that ensures only one pod runs optimize or retention at a time, even in multi-replica deployments. The scheduler ticks every 5 minutes; the actual run schedule is persisted in next_run_at and survives pod restarts.

The engine actor can also run periodic data expiry. When retention_days is configured, it issues a logical delete against the partition_date partition column, then immediately vacuums the freed files. Because the predicate maps directly to a partition directory, Delta Lake skips all unaffected partitions.

Retention is also coordinated via the control table (task name: trace_retention), defaulting to a 24-hour schedule.

Removes old Parquet file versions that are no longer referenced by the Delta log, freeing object storage space. Vacuum runs automatically after both compaction and retention. It can also be triggered on-demand via TraceSpanService::vacuum(retention_hours). Compaction is available on-demand via TraceSpanService::optimize().


In a K8s deployment with separate writer and reader pods (e.g. a dedicated gRPC ingest pod and one or more HTTP query pods), each pod holds its own in-memory Delta table snapshot. After the writer commits a new batch, reader pods do not automatically see the new data — their snapshot is stale until refreshed.

The refresh ticker inside TraceSpanDBEngine solves this. On every tick it:

  1. Clones the current DeltaTable guard (the original is preserved if the operation fails)
  2. Calls update_incremental() on the clone to pull new Delta log entries from shared storage
  3. Re-registers the SessionContext only if the table version advanced — no-ops on idle tables

The ticker interval is controlled by SCOUTER_TRACE_REFRESH_INTERVAL_SECS (default 10). At 10 s, a reader pod will see new data within ~10 seconds of a writer commit, with ~8,640 object-store LIST calls per day — negligible for GCS/S3.

Deployment model:

GCS / S3 / Azure Blob
├── Pod A (writer) gRPC ingest → TraceSpanDBEngine (commits new files)
└── Pod B (reader) HTTP queries ← TraceSpanDBEngine (refresh_ticker pulls commits every 10 s)

Both pods point SCOUTER_STORAGE_URI at the same bucket path. The writer’s single-writer invariant is preserved — only one pod writes to the Delta log at a time. Reader pods never write; they only call update_incremental().

Compaction coordination (optimize + retention) is already protected by the control table — only one pod runs it at a time, regardless of how many reader pods are running the refresh ticker.

Tuning:

ScenarioRecommended SCOUTER_TRACE_REFRESH_INTERVAL_SECS
Single pod (writer = reader)10 (default) — refresh is a no-op when no other writer exists
Multi-pod, near-real-time reads5 — halves visibility latency, doubles LIST call rate
Multi-pod, cost-sensitive storage3060 — acceptable for dashboards that poll every minute

The optimization PR introduced several layers of read and write improvements that apply across all storage backends.

Every storage backend (GCS, S3, Azure, local) is now wrapped in a CachingStore. After Z-ORDER compaction, Parquet files are immutable — the same path always returns the same bytes. CachingStore caches two call types:

  • head() — up to 10,000 entries, 1-hour TTL. Eliminates repeated HEAD requests DataFusion issues to check file size before opening each file.
  • get_range() — byte-addressed cache with a configurable max size (default 64 MB), 1-hour TTL. Only ranges ≤2 MB are cached; larger column-data reads (uncommon for footer-heavy queries) pass through uncached.

Cache size is configurable via SCOUTER_OBJECT_CACHE_MB. On GCS/S3 workloads where footer reads dominate query latency, this eliminates 1–2 round-trips (~30–60 ms each) per file per query.

The SessionContext is now built with a tuned SessionConfig applied to all query and compaction operations:

SettingValueEffect
target_partitionsavailable_parallelismUses all CPU cores for parallel query execution
batch_size8192Controls Arrow RecordBatch size during query scans
prefer_existing_sorttrueAvoids re-sorting data that Z-ORDER already sorted
parquet_pruningtrueEnables min/max statistics-based file pruning
collect_statisticstrueGathers column statistics to improve query planning
pushdown_filterstruePushes predicates into the Parquet reader — only matching rows are decoded
reorder_filterstrueReorders predicates by selectivity — bloom filters (trace_id) evaluated before range checks (start_time)
metadata_size_hint1 MBFetches Parquet footer in one cloud round-trip; default 512KB is insufficient for files with bloom filters and page-level statistics
bloom_filter_on_readtrueConsults row-group bloom filters before decoding; explicit to guard against DataFusion version changes
schema_force_view_typestrueReads Utf8 columns as Utf8View — matches the schema’s StringView type and avoids a downgrade on read
meta_fetch_concurrency64Parallel file stat operations when listing Delta table files; matches the HTTP connection pool size
maximum_parallel_row_group_writers4Encodes multiple row groups concurrently during compaction and flush
maximum_buffered_record_batches_per_stream8Smooths bursty GCS reads by buffering more decoded batches per stream

Cloud object stores (GCS, S3, Azure) are built with shared ClientOptions:

  • Pool idle timeout: 120 seconds
  • Max idle connections per host: 64 (matches meta_fetch_concurrency)
  • Request timeout: 30 seconds
  • Connect timeout: 5 seconds

Both flush writes and Z-ORDER compaction use the same WriterProperties:

SettingValueRationale
max_row_group_size32,768 rowsCreates ~4 row groups per 128 MB file; bloom filters and page statistics prune within files, not just across files
Bloom filter: trace_idFPP 0.01, NDV 32,768Skips ~99% of row groups for trace_id equality lookups
Bloom filter: service_nameFPP 0.01, NDV 256Low cardinality but hot lookup path
Bloom filter: span_nameFPP 0.01, NDV 32,768High cardinality equality queries
Page stats: start_timePage-levelFinest-grained time pruning within row groups
Page stats: status_codePage-levelPrunes pages for error-only queries (no bloom filter — only 3 values)
Encoding: start_time, duration_msDELTA_BINARY_PACKED4–8x compression on near-sorted integers after Z-ORDER; 2–4x on durations within a service
CompressionZSTD level 3~40% better than SNAPPY on text columns; marginal decompression overhead is offset by reduced I/O
Dictionary: span_nameenabledHigh repetition similar to service_name

Environment VariableDefaultDescription
SCOUTER_STORAGE_URI./scouter_storageObject store root. Supports s3://, gs://, az://, or local path
SCOUTER_TRACE_COMPACTION_INTERVAL_HOURS24How often automatic Z-ORDER compaction runs
SCOUTER_TRACE_FLUSH_INTERVAL_SECS5How often the span buffer flushes to Delta Lake
SCOUTER_TRACE_BUFFER_SIZE10000Span buffer capacity before a flush is triggered
SCOUTER_TRACE_REFRESH_INTERVAL_SECS10How often each pod refreshes its Delta table snapshot from shared storage. Controls cross-pod visibility latency in multi-pod deployments.
SCOUTER_OBJECT_CACHE_MB64Maximum size of the in-process object store range cache (MB)
AWS_REGIONus-east-1Required when using S3 storage

The storage layer uses the ObjectStore abstraction, supporting all major cloud providers and local filesystems with no code changes. Every backend is wrapped in CachingStore — the caching layer is transparent to the rest of the system.

BackendURI PrefixNotes
Local filesystem./path or /abs/pathDefault; good for development
Amazon S3s3://bucket/prefixRequires AWS_REGION and standard AWS credentials
Google Cloud Storagegs://bucket/prefixUses Application Default Credentials; also accepts GOOGLE_ACCOUNT_JSON_BASE64 for service account key injection
Azure Blob Storageaz://container/prefixAccepts both AZURE_STORAGE_ACCOUNT_NAME/AZURE_STORAGE_ACCOUNT_KEY (object_store convention) and AZURE_STORAGE_ACCOUNT/AZURE_STORAGE_KEY (az CLI/Terraform convention)

The same Delta Lake protocol and DataFusion query engine run identically across all backends.