Storage Engine

Overview

The storage engine turns incoming events into durable, immutable data you can query quickly. It’s built around append-only writes, in-memory buffering, and on-disk segments that are efficient to scan and easy to skip.

Core Components

WAL (write-ahead log): Per-shard durability log. Every accepted event is appended here first.
MemTable: In-memory buffer for recent events. Fast inserts; swapped out when full.
Flush worker: Converts a full MemTable into an immutable on-disk segment in the background.
Segments: On-disk building blocks (columns, zone metadata, filters, lightweight indexes, index catalogs).
Snapshots: Optional utility files (.snp events, .smt metadata) for export/replay and range bookkeeping.
Materializations: Per-alias snapshots written by REMEMBER QUERY (catalog + frame files for remembered queries).
Compactor (covered later): Merges small segments into larger ones to keep reads predictable.

Write Path (At a Glance)

Validate payload against the event type schema.
Append to the WAL (durability point).
Apply to the MemTable (fast in-memory structure).
When the MemTable hits a threshold, swap it out and enqueue a background flush.
Flush worker writes a new segment and publishes it atomically.

See the diagram below:

Write path

Write Path (In Depth)

0) Validate the event

What: Check the incoming payload against the registered schema for its event_type.
Why: Ensures only well-formed data enters the system so downstream files and indexes remain consistent.

Example:

{
  "timestamp": 1700000000,
  "event_type": "signup",
  "context_id": "user-42",
  "payload": { "plan": "pro", "country": "US" }
}

Equivalent command:

STORE signup FOR user-42 PAYLOAD {"plan":"pro","country":"US"}

Validation ensures required fields exist and types are correct (for example, the event_type is known and a “plan” is provided in the payload).

1) Append to the WAL (durability point)

What: Append the validated event to the per-shard Write-Ahead Log (WAL).
Why: Once the append returns, the event will survive a crash. On restart, the system replays WAL entries to rebuild in-memory state and complete any interrupted flushes.
Notes:
- WAL records are lightweight, line-oriented appends (JSON-serialized per line).
- WAL files rotate in sync with the MemTable flush threshold (engine.flush_threshold), so replay windows are bounded by flush points. After a successful flush, older WAL files up to that cutoff can be pruned.
- Behavior is tunable via config: [wal] enabled, dir, buffered, buffer_size, flush_each_write, fsync, fsync_every_n and [engine] flush_threshold.

Crash safety example:

If the process crashes after the WAL append but before the event hits memory, recovery will re-insert it into the MemTable on startup.

2) Insert into the MemTable (fast in-memory apply)

What: Place the event into the in-memory, append-friendly, queryable buffer (MemTable).
Why: Absorb writes in memory to batch them into large, sequential segment writes (avoids random I/O), maintain backpressure with bounded memory, and maximize ingest throughput. As a secondary benefit, new events are immediately visible to queries.
Behavior:
- The MemTable is sized by flush_threshold (config). When it reaches capacity, it triggers a swap and a background flush.
- Inserts are grouped by context so the flusher can scan them quickly.

Small example:

flush_threshold = 4
Incoming events (in order): A, B, C, D, E
- A, B, C, D go into the active MemTable. After D, the MemTable is full.
- A background flush is enqueued for these four; a fresh MemTable becomes active.
- E enters the new MemTable immediately (no blocking on the background flush).

3) Swap and enqueue a background flush

What: When the active MemTable is full, it’s atomically swapped for a fresh, empty one, and the full snapshot is queued for flushing.
Why: Writers remain responsive (no long I/O in the foreground) and the system maintains bounded memory.
Details:
- The passive MemTable (now immutable) is handed off to the flush worker.
- Writes proceed into the newly created active MemTable.

4) Flush worker writes a new immutable segment

What: The background worker turns the passive MemTable into an on-disk segment directory (for example, 00042/).
Inside the segment:
- Column files: One file per field, optimized for sequential appends and later memory-mapped (mmap) access. Naming: <uid>_<field>.col. Example: u01_event_id.col, u01_timestamp.col, u01_event_type.col, u01_context_id.col, u01_plan.col, u01_country.col. Where <uid> is defined per event type.
- Zone metadata: Per-zone min/max timestamps, row ranges, and presence stats for pruning.
- Filters/Indexes (policy-driven):
  - XOR: <uid>_<field>.xf (approximate membership)
  - Enum Bitmap (EBM): <uid>_<field>.ebm (eq/neq for enums)
  - Zone SuRF: <uid>_<field>.zsrf (succinct range filter for >, >=, <, <=)
  - Zone XOR Index: <uid>_<field>.zxf (per-zone XOR index for equality pruning)
  - Temporal Calendar: <uid>_<field>.cal (per-field day/hour → zone ids)
  - Temporal Index (slab): <uid>_<field>.tfi (per-field slab of per-zone temporal indexes)
- Index Catalog: {uid}.icx describing which IndexKinds exist per field and globally for this segment.
- Filters/Indexes (policy-driven):
  - XOR: <uid>_<field>.xf
  - Enum Bitmap (EBM): <uid>_<field>.ebm
  - Zone SuRF: <uid>_<field>.zsrf
  - Zone XOR Index: <uid>_<field>.zxf
- Index Catalog: {uid}.icx (binary header + bincode SegmentIndexCatalog) recording available IndexKinds per field and globally for the segment.
- Offsets/Index: Per-zone compressed offsets (.zfc files) describing compressed block ranges and in-block offsets.
- Snapshots (optional):
- Event Snapshots (.snp): portable arrays of events with a binary header + length‑prefixed JSON entries.
- Snapshot Metadata (.smt): arrays of {uid, context_id, from_ts, to_ts} entries with a binary header + length‑prefixed JSON.
Publication: Segment creation is atomic at the directory level; once complete, readers can discover and scan it.

See the diagram below:

UID storage example

Sizing example:

flush_threshold = 32_768
events_per_zone = 2_048
A full flush of 32,768 events creates exactly 16 zones. Each zone has its own metadata and contributes field values to the filter files. Larger events_per_zone values reduce metadata overhead but offer coarser pruning; smaller values increase pruning precision at the cost of more metadata.

5) Cleanup and WAL compaction

What: After a successful flush, the system can prune or rotate old WAL files up to a cutoff corresponding to flushed data.
Why: Keeps recovery time short and disk usage bounded.

End-to-end write example

Client sends STORE signup ... with a valid payload.
The engine validates the event against the signup schema.
The event is appended to the WAL for shard 3 (durability).
The event is inserted into shard 3’s active MemTable.
When the MemTable reaches flush_threshold, it is swapped and the old one is queued for the background flush.
The flush worker writes 00137/ with column files, 16 zones (if 32,768/2,048), zone metadata, policy-selected filters/indexes (XF/EBM/ZSf/ZXF), an Index Catalog {uid}.icx, and offsets/index.
Once published, queries immediately see the segment alongside any newer in-memory events.
The WAL up to (and including) the flushed range is now safe to compact or rotate.

Failure model (write path)

Crash before WAL append: The event is lost (not acknowledged).
Crash after WAL append but before MemTable insert: The event is recovered from the WAL and re-applied on startup.
Crash after MemTable insert but before flush: The event is not yet in a segment, but it is durable in the WAL. On restart, WAL replay restores it to the MemTable; if a swap occurred and a passive MemTable existed, its contents are reconstructed from WAL as well. No data loss; no duplicate segments.
Crash during flush: The WAL still contains the flushed events; on restart, the system replays or completes the flush. Partially written segments are ignored until a valid, fully published segment is present.

Tuning the write path

shards: More shards increase parallelism of WAL, MemTable, and flush pipelines (at the cost of more intense CPU and RAM usage and more files and directories).
flush_threshold: Controls MemTable size. Higher values reduce flush frequency (bigger segments) but increase peak memory and WAL replay cost.
events_per_zone: Smaller values improve pruning for reads but increase metadata and filter counts. Pick based on query selectivity and typical field cardinalities.

Durability & Recovery

Covered in the write path: WAL append is the durability point; replay restores MemTables; WAL rotation keeps recovery bounded. See Failure model above.

Backpressure & Safety

Bounded channels between components provide backpressure under load (writers slow down instead of exhausting memory).
Async workers (flush and compaction) are throttled so foreground writes and reads stay responsive.

This is the spine of the engine: durable append, fast memory, immutable segments with rich metadata, and just enough background work to keep reads snappy as data grows.

Policy-driven index build (write-time)

Index builds are determined by an IndexBuildPolicy and an IndexBuildPlanner that produce a per-field BuildPlan of IndexKind bitflags and global kinds.
ZoneWriter consumes the plan to build only the requested artifacts; legacy catch‑all builders were removed in favor of filtered builders (e.g., build_all_filtered).
RLTE (if enabled) is included via the policy and emitted best‑effort.

Read-time catalogs and planning

Each segment’s {uid}.icx is loaded (and cached) into a SegmentIndexCatalog.
IndexRegistry aggregates catalogs across segments; IndexPlanner chooses an explicit IndexStrategy per filter based on available kinds and the schema.
Strategy selection uses a representative segment that actually has a catalog; if no catalog/kinds exist for a field/segment, the planner chooses FullScan to avoid filesystem probing.
- Temporal strategies are field-aware: TemporalEq { field } and TemporalRange { field } use the per-field calendar and slabbed temporal index for both the fixed timestamp and payload datetime fields (e.g., created_at).

Read-time Projection & Column Pruning

The query planner derives a minimal column set to load based on:
- Core fields: context_id, event_type, timestamp (always loaded)
- Filter fields used in WHERE
- Requested payload fields from RETURN [ ... ] (if provided)
If RETURN is omitted or empty (RETURN []), all payload fields are considered eligible.
Unknown fields in RETURN are ignored (schema-driven).
Only the selected columns are mmap’d and read; others are skipped entirely, reducing I/O and memory.
Projection decisions are logged under the query::projection target for debugging.

Materialized Query Store

Materialized Query Store
- Directory structure: materializations/catalog.bin + materializations/<alias>/frames/NNNNNN.mat + manifest.bin.
- Catalog entries store the serialized query spec (JSON in bincode), schema snapshot, byte/row counters, high-water mark, retention hints, and telemetry deltas.
- Each frame is typed row storage with a shared header + LZ4-compressed payload (null bitmap + typed values) so it can be streamed back verbatim.
- SHOW streams existing frames, runs a delta pipeline using the stored high-water mark, appends new frames, and updates catalog metadata.
- Retention policies (max rows / max age) are persisted but currently optional; when present, frames are pruned after each append.

Keyboard shortcuts

SnelDB: The Complete Guide