Bifrost overview
Reading and writing high volumes of data in an AI application should be easy, but it’s not. The ecosystem is fragmented, and the engineering burden is high. You can use a managed data warehouse, but that adds latency and cost. You can write to object storage, but then you have to manage schemas, partitions, and query engines yourself.
None of these give you what you actually want: define a schema, push data, query it later — without managing infrastructure.
What Bifrost Does
Section titled “What Bifrost Does”Scouter Bifrost turns a Pydantic model into a Delta Lake table. You define the schema in Python, push records through a high-throughput queue, and query with SQL — getting results back as Arrow tables, Polars DataFrames, Pandas DataFrames, or validated Pydantic model instances.
Write: Pydantic model → Arrow schema → gRPC batches → Delta Lake (you define) (automatic) (background) (server-side)
Read: SQL query → DataFusion → Arrow IPC → Arrow / Polars / Pandas (you write) (server) (zero-copy) (your choice)The write path is designed for production inference loops: sub-microsecond insert() calls, automatic batching, and backpressure handling. No await, no blocking, no GIL contention.
The read path is designed for analytics: full SQL (joins, CTEs, window functions, aggregations) powered by Apache DataFusion, with zero-copy Arrow IPC delivery that converts directly to your preferred DataFrame library.
Architecture
Section titled “Architecture”graph LR
subgraph Write Path
A[Your App] -->|insert record| B[DatasetProducer]
B -->|channel send| C[Event Handler]
C -->|batch full or timer| D[Arrow Batch Builder]
D -->|IPC bytes| E[gRPC]
end
E -->|insert_batch| F[Scouter Server]
F <-->|read/write| G[Delta Lake]
subgraph Read Path
H[DatasetClient] -->|sql / read| I[gRPC]
I -->|query_dataset| F
F -->|DataFusion| G
I -->|QueryResult| H
H -->|.to_arrow / .to_polars / .to_pandas| J[Your Analysis]
end
Write flow
Section titled “Write flow”-
You call
producer.insert(record)— serializes the Pydantic model to JSON and sends it through an unbounded channel. Returns immediately. -
Event handler receives the JSON — pushes it onto an internal
ArrayQueue. If the queue reachesbatch_size, it triggers a flush. -
Background flush task wakes periodically — if
scheduled_delay_secshas elapsed since the last publish, it drains the queue regardless of size. -
Arrow batch is built —
DynamicBatchBuilderconverts accumulated JSON rows into an ArrowRecordBatch. System columns are injected automatically. -
IPC bytes sent via gRPC — the batch is serialized to Arrow IPC format and sent to the server.
-
Server appends to Delta Lake — stored at the path derived from
catalog.schema_name.table, partitioned byscouter_partition_date.
Read flow
Section titled “Read flow”-
You call
client.sql(query)— sends the SQL string to the server via gRPC. -
Server executes via DataFusion — the query runs against the shared
SessionContextwhere all registered tables are visible. Joins across tables, CTEs, window functions, and aggregations all work. -
Results returned as
QueryResult— a wrapper around Arrow IPC bytes. Call.to_arrow(),.to_polars(), or.to_pandas()to convert.
For strict reads, client.read() constructs the SQL internally and returns validated Pydantic model instances.
Clients
Section titled “Clients”| Client | Purpose | Lifecycle |
|---|---|---|
Bifrost | Unified read/write client | Long-lived, call shutdown() on exit |
DatasetProducer | Write records to a Bifrost table | Long-lived, background queue, call shutdown() on exit |
DatasetClient | Read and query Bifrost tables | Stateless queries, bound to a table via TableConfig |
All clients use gRPC transport configured via GrpcConfig. Use Bifrost when you need both read and write access; use DatasetProducer or DatasetClient directly when you need only one.
Schema-on-Write
Section titled “Schema-on-Write”Unlike append-only logging, Bifrost enforces schema at write time. When you create a TableConfig, the Pydantic model’s JSON Schema is converted to an Arrow schema and fingerprinted:
- Schema mismatch caught before data lands — the server verifies the fingerprint on every batch.
- Schema changes produce a different fingerprint — adding, removing, or changing a field type all invalidate the fingerprint.
- System columns injected automatically — you don’t include them in your model.
- Fingerprint validated on read — when you create a
DatasetClientwith aTableConfig, the client verifies the server’s fingerprint matches your code.
System columns
Section titled “System columns”Every Bifrost table includes three columns managed by Scouter:
| Column | Type | Description |
|---|---|---|
scouter_created_at | Timestamp(us, UTC) | When the batch was built |
scouter_partition_date | Date32 | Today’s date — used for Delta Lake partitioning |
scouter_batch_id | Utf8 (UUID v7) | Unique ID per batch — same for all rows in one flush |
Supported types
Section titled “Supported types”| Python type | Arrow type | Notes |
|---|---|---|
str | Utf8View | |
int | Int64 | |
float | Float64 | |
bool | Boolean | |
datetime | Timestamp(us, UTC) | |
date | Date32 | |
Optional[T] | nullable T | |
List[T] | List(T) | Nested lists supported |
Enum | Dictionary(Int16, Utf8) | String enum values |
Nested BaseModel | Struct(...) | Recursive — up to 32 levels |
Bifrost and Monitoring
Section titled “Bifrost and Monitoring”Bifrost is a standalone system — it handles data storage and retrieval independent of Scouter’s drift detection and alerting features.
That said, the two are designed to converge. In a future release, you will be able to write custom queries against data in your Bifrost tables and use the results to drive monitoring alerts. This means you can define domain-specific health checks (e.g., “alert if the 95th percentile prediction confidence drops below 0.7 over the last hour”) directly on the data you’re already storing — without maintaining a separate analytics pipeline.
Next Steps
Section titled “Next Steps”- Quickstart — end-to-end write and read example
- Writing Data —
DatasetProducerconfiguration and patterns - Reading Data —
DatasetClient, SQL queries, format conversions - Schema Reference —
TableConfig, type mapping, fingerprinting