Storage Engine
This document provides a detailed overview of Oxibase’s storage engine, including its design principles, components, and how data is stored and retrieved.
Storage Engine Design
Oxibase’s storage engine is designed with the following principles:
- Memory-optimized - Prioritizes in-memory performance with optional persistence
- MVCC-based - Uses multi-version concurrency control for transaction isolation
- Version-organized - Tracks different versions of rows for transaction isolation
- Type-specialized - Uses different strategies for different data types
- Index-accelerated - Multiple index types to optimize different query patterns
Architecture Overview
graph TB
subgraph "Engine Layer - Orchestration"
MVCCEngine["MVCCEngine<br/>src/storage/mvcc/engine.rs<br/>• Table lifecycle<br/>• Transaction registry<br/>• Persistence manager<br/>• Schema catalog"]
TransactionRegistry["TransactionRegistry<br/>• Transaction ID allocation<br/>• Commit sequence numbers<br/>• Visibility checking"]
PersistenceManager["PersistenceManager<br/>• WAL management<br/>• Snapshot creation<br/>• Recovery coordination"]
end
subgraph "Table Layer - Transaction-aware Access"
MVCCTable["MVCCTable<br/>src/storage/mvcc/table.rs<br/>• Transaction isolation<br/>• Row operations<br/>• Index selection<br/>• Commit/rollback"]
TransactionVersionStore["TransactionVersionStore<br/>• Uncommitted changes<br/>• Write-set tracking<br/>• Conflict detection"]
end
subgraph "Storage Layer - Version Management"
VersionStore["VersionStore<br/>src/storage/mvcc/version_store.rs<br/>• Version chains<br/>• Visibility filtering<br/>• Arena-based scanning<br/>• Zone maps"]
RowArena["RowArena<br/>• Contiguous memory storage<br/>• 50x+ scan speedup<br/>• Zero-copy reads"]
VersionChainEntry["VersionChainEntry<br/>• Current version<br/>• Arc prev pointer<br/>• Arena index"]
end
subgraph "Index Layer"
BTreeIndex["BTreeIndex<br/>src/storage/mvcc/btree_index.rs<br/>• Range queries O(log n + k)<br/>• Cached min/max"]
HashIndex["HashIndex<br/>src/storage/mvcc/hash_index.rs<br/>• Equality O(1)<br/>• ahash for TEXT/JSON"]
BitmapIndex["BitmapIndex<br/>• Boolean columns<br/>• Bitwise AND/OR/NOT"]
MultiColumnIndex["MultiColumnIndex<br/>• Composite key lookups<br/>• Prefix matching"]
end
MVCCEngine -->|creates| MVCCTable
MVCCEngine -->|manages| TransactionRegistry
MVCCEngine -->|uses| PersistenceManager
MVCCTable -->|reads/writes| TransactionVersionStore
MVCCTable -->|delegates to| VersionStore
TransactionVersionStore -->|commits to| VersionStore
VersionStore -->|stores versions in| RowArena
VersionStore -->|maintains| VersionChainEntry
VersionStore -->|updates| BTreeIndex
VersionStore -->|updates| HashIndex
VersionStore -->|updates| BitmapIndex
VersionStore -->|updates| MultiColumnIndex
TransactionRegistry -.checks visibility.-> VersionStore
Core Components and Responsibilities
The storage engine consists of four main classes that handle different aspects of data storage and retrieval:
| Component | File | Primary Responsibilities |
|---|---|---|
MVCCEngine |
engine.rs:254-279 | Database lifecycle, table creation/deletion, transaction coordination, schema management, persistence orchestration |
MVCCTable |
table.rs:36-45 | Per-transaction table view, row insert/update/delete, index lookup optimization, commit/rollback execution |
VersionStore |
version_store.rs:157-193 | Version chain management, visibility checking, arena-based scanning, row counting, index population |
TransactionVersionStore |
version_store.rs:1354-1414 | Uncommitted changes buffer, write-set conflict tracking, local version cache |
Storage Engine Data Flow
This diagram shows how a typical transaction operation flows through the storage layers, from the public API down to the physical storage structures.
sequenceDiagram
participant Client
participant Transaction as "Transaction<br/>(MvccTransaction)"
participant Engine as "MVCCEngine"
participant Table as "MVCCTable"
participant TxnStore as "TransactionVersionStore"
participant VerStore as "VersionStore"
participant Index as "BTreeIndex/HashIndex"
participant WAL as "WALManager"
Client->>Engine: begin_transaction()
Engine->>TransactionRegistry: allocate txn_id
Engine->>Transaction: MvccTransaction(txn_id)
Client->>Transaction: execute("INSERT INTO users VALUES (...)")
Transaction->>Engine: get_table_for_transaction("users")
Engine->>Table: MVCCTable::new_with_shared_store()
Transaction->>Table: insert(row)
Table->>TxnStore: add_local_version(row_version)
Note over TxnStore: Buffered in memory<br/>Not yet visible
Client->>Transaction: commit()
Transaction->>Table: commit()
Table->>Index: update indexes (old_value, new_value)
Table->>TxnStore: commit()
TxnStore->>VerStore: add_versions_batch()
VerStore->>RowArena: insert_row()
Note over VerStore: Versions now visible<br/>to new transactions
VerStore->>WAL: record_operation()
WAL-->>VerStore: LSN
Table-->>Transaction: Ok
Transaction->>TransactionRegistry: mark_committed(commit_seq)
Transaction-->>Client: Ok
Storage Components
Table Structure
Tables in Oxibase are composed of:
- Metadata - Schema information, column definitions, and indexes
- Row Data - The primary data storage, organized by row
- Version Store - Tracks row versions for MVCC
- Indexes - B-tree, Hash, Bitmap, and multi-column indexes
- Transaction Manager - Manages transaction state and visibility
Data Types
Oxibase supports a variety of data types, each with optimized storage:
- INTEGER - 64-bit signed integers
- FLOAT - 64-bit floating-point numbers
- TEXT - Variable-length UTF-8 strings
- BOOLEAN - Boolean values (true/false)
- TIMESTAMP - Date and time values
- JSON - JSON documents
- NULL - Null values supported for all types
Version Management
Oxibase tracks different versions of data for transaction isolation:
- Each change creates a new version rather than overwriting
- Versions are associated with transaction IDs
- Visibility rules determine which versions each transaction can see
- Old versions are garbage collected when no longer needed
Data Storage Format
In-Memory Format
In memory, data is stored with these characteristics:
- Row-based primary storage - Records are stored as coherent rows
- Version chains - Linked versions for MVCC
- Type-specific indexes - B-tree, Hash, Bitmap based on column type
- Efficient structures - Optimized for different data types
On-Disk Format
When persistence is enabled, data is stored on disk with:
- Binary serialization - Compact binary format for storage
- WAL files - Write-ahead log for durability
- Snapshot files - Point-in-time table snapshots
- Metadata files - Schema and index information
MVCC Implementation
The storage engine uses MVCC to provide transaction isolation:
- Full Version Chains - Version history per row linked via pointers
- Transaction IDs - Each version is associated with a transaction ID
- Visibility Rules - Traverse version chains to find visible versions
- Lock-Free Reads - Readers never block writers
- Automatic Cleanup - Old versions garbage collected when no longer needed
For more details, see the MVCC Implementation and Transaction Isolation documentation.
Version Chain Management
Oxibase implements MVCC by maintaining linked version chains for each row. Each VersionChainEntry contains a RowVersion and an Arc pointer to the previous version, enabling O(1) chain cloning for snapshot isolation.
graph LR
subgraph "versions: ConcurrentInt64Map<VersionChainEntry>"
row1["row_id: 1"]
row2["row_id: 2"]
row3["row_id: 3"]
end
subgraph "Version Chain for row_id=1"
v3["VersionChainEntry<br/>txn_id: 30<br/>deleted_at: 30<br/>arena_idx: None"]
v2["VersionChainEntry<br/>txn_id: 20<br/>deleted_at: 0<br/>arena_idx: Some(15)"]
v1["VersionChainEntry<br/>txn_id: 10<br/>deleted_at: 0<br/>arena_idx: Some(5)"]
v3 -->|"prev: Arc"| v2
v2 -->|"prev: Arc"| v1
v1 -->|"prev: None"| end1["Initial version"]
end
subgraph "RowArena - Contiguous Memory"
arena_idx_5["idx 5: row_id=1, txn=10, data=[...]"]
arena_idx_15["idx 15: row_id=1, txn=20, data=[...]"]
arena_other["Other rows..."]
end
row1 --> v3
v1 -.arena_idx: Some(5).-> arena_idx_5
v2 -.arena_idx: Some(15).-> arena_idx_15
Key data structures:
versions: ConcurrentInt64Map<VersionChainEntry>- Mapsrow_idto the head of its version chainVersionChainEntry- Linked list node withversion: RowVersion,prev: Option<Arc<...>>, andarena_idx: Option<usize>RowArena- Contiguous memory storage providing 50x+ faster full table scans via zero-copy reads
The use of Arc for the prev pointer enables cheap snapshot cloning—creating a snapshot of a table is O(1) because the chain is reference-counted rather than deep-copied.
Transaction Version Store
The TransactionVersionStore buffers uncommitted changes within a transaction, maintaining write-set information for conflict detection during commit.
| Field | Type | Purpose |
|---|---|---|
local_versions |
Int64Map<RowVersion> |
Uncommitted inserts/updates/deletes |
write_set |
Int64Map<WriteSetEntry> |
Conflict detection: tracks read version and sequence number |
version_store |
Arc<VersionStore> |
Reference to global committed storage |
txn_id |
i64 |
Transaction identifier |
Commit process:
- Check uniqueness constraints on all indexes
- Update indexes with (old_value, new_value) pairs
- Flush
local_versionstoVersionStorein batch - Mark zone maps as stale (for optimizer statistics)
Rollback process:
- Discard all entries in
local_versions - Clear
write_settracking data
Index Architecture
Oxibase provides three index types, each optimized for different query patterns. The MVCCTable automatically selects the optimal index type based on column data types.
graph TB
subgraph "Index Selection Logic"
auto_select["MVCCTable::auto_select_index_type()<br/>table.rs:105-130"]
end
subgraph "Index Type Decision Tree"
check_bool["Is column BOOLEAN?"]
check_text["Is column TEXT/JSON?"]
check_numeric["Is column INTEGER/FLOAT/TIMESTAMP?"]
bitmap_idx["IndexType::Bitmap<br/>Fast AND/OR/NOT<br/>Low cardinality"]
hash_idx["IndexType::Hash<br/>O(1) equality<br/>Avoids O(strlen)"]
btree_idx["IndexType::BTree<br/>O(log n) range queries<br/>Cached min/max"]
end
subgraph "Index Implementations"
BTreeIndexImpl["BTreeIndex<br/>btree_index.rs:74-113<br/>• BTreeMap: sorted values<br/>• AHashMap: O(1) equality<br/>• FxHashMap: row->value"]
HashIndexImpl["HashIndex<br/>hash_index.rs:105-127<br/>• ahash: fast hashing<br/>• SmallVec: inline storage<br/>• Collision handling"]
BitmapIndexImpl["BitmapIndex<br/>• Roaring bitmaps<br/>• Bitwise operations<br/>• Compressed storage"]
end
auto_select --> check_bool
check_bool -->|Yes, single column| bitmap_idx
check_bool -->|No| check_text
check_text -->|Yes| hash_idx
check_text -->|No| check_numeric
check_numeric -->|Yes| btree_idx
bitmap_idx --> BitmapIndexImpl
hash_idx --> HashIndexImpl
btree_idx --> BTreeIndexImpl
Index selection rules:
- BOOLEAN →
BitmapIndexfor fast bitwise AND/OR/NOT operations - TEXT/JSON →
HashIndexto avoid O(strlen) comparisons in B-tree nodes - INTEGER/FLOAT/TIMESTAMP →
BTreeIndexfor range query support - Mixed types →
BTreeIndexas safe default
BTreeIndex Internal Structure
The BTreeIndex maintains dual data structures for optimal performance across different query types:
| Data Structure | Type | Purpose | Complexity |
|---|---|---|---|
sorted_values |
RwLock<BTreeMap<Value, RowIdSet>> |
Range queries, MIN/MAX | O(log n + k) |
value_to_rows |
RwLock<AHashMap<Value, RowIdSet>> |
Equality lookups | O(1) |
row_to_value |
RwLock<FxHashMap<i64, Value>> |
Removal by row_id | O(1) |
cached_min |
RwLock<Option<Value>> |
MIN aggregate | O(1) |
cached_max |
RwLock<Option<Value>> |
MAX aggregate | O(1) |
The dual-index strategy (BTreeMap + HashMap) trades memory (~2x for unique values) for optimal query performance: O(1) equality via hash lookup and O(log n + k) range queries via sorted iteration.
HashIndex Characteristics
The HashIndex is designed for high-cardinality TEXT columns where equality queries dominate:
Advantages:
- O(1) exact match via
ahash(faster than SipHash) - Avoids O(strlen) string comparisons in B-tree traversal
SmallVec<[i64; 4]>reduces allocations for unique indexes
Limitations:
- Does NOT support range queries
- Does NOT support ORDER BY optimization
- Requires exact match on all indexed columns (no partial key lookups)
Triple-lock write pattern:
add() acquires:
1. hash_to_rows: Write
2. row_to_hash: Write
3. hash_to_values: Write
This ensures atomicity but serializes concurrent writes. Read operations (find) only acquire a single read lock for minimal contention.
Row Arena: Zero-Copy Scanning
The RowArena stores row data in contiguous memory to eliminate per-row allocation overhead during full table scans:
Performance characteristics:
- 50x+ speedup for full table scans vs. per-row cloning
- Pre-acquire locks once per scan (O(1) instead of O(N))
- Direct slice access via
unsafebounds-checked reads - Cache locality from contiguous layout
Arena structure:
arena_rows: Vec<RowMetadata> // [start, end, row_id, txn_id, create_time, deleted_at]
arena_data: Vec<Value> // Flattened row data
row_arena_index: Int64Map<usize> // row_id -> arena_idx
Usage pattern:
- Acquire
arena_rowsandarena_dataread guards once - Iterate version chains checking visibility
- For visible versions, read directly from arena via
arena_idx - Release locks after iteration completes
- Sort results by
row_id(if needed)
Data Access Paths
Point Lookups
For point queries (e.g., WHERE id = 5):
- Use primary key or index to locate the row
- Apply visibility rules based on transaction
- Return the visible version
Range Scans
For range queries (e.g., WHERE price > 100):
- Use B-tree index if available for the column
- Scan matching index entries
- Apply visibility rules to each row
- Return visible results
Full Table Scans
For queries without applicable indexes:
- Scan all rows in the table
- Apply WHERE clause filters
- Apply visibility rules
- For large tables, parallelize the scan
Data Modification
Insert Operations
When data is inserted:
- Values are validated against column types
- A new row version is created with the current transaction ID
- The row is added to the primary row storage
- Indexes are updated
- The operation is recorded in the WAL (if enabled)
Update Operations
When data is updated:
- The existing row is located via indexes or scan
- A new version is created with updated values
- The new version links to the previous version
- Indexes are updated to reflect the changes
- The operation is recorded in the WAL (if enabled)
Delete Operations
When data is deleted:
- The existing row is located
- A deletion marker version is created
- Indexes are updated to reflect the deletion
- The operation is recorded in the WAL (if enabled)
Persistence and Recovery
When persistence is enabled:
Write-Ahead Logging (WAL)
The storage engine provides crash consistency through Write-Ahead Logging (WAL) and periodic snapshots:
WAL Manager responsibilities:
- Sequential writes with CRC32 checksums
- Two-phase recovery: Phase 1 identifies committed transactions, Phase 2 applies their changes
- Compression (optional) for reduced I/O
Snapshot system:
- Binary format with magic bytes (
0x50414E53= “SNAP”) - Atomic 3-phase writes: temp file → sync → rename
- Tracks source LSN for incremental recovery
Recovery flow:
- Read snapshot metadata to get checkpoint LSN
- Load table snapshots (fastest recovery path)
- Replay WAL entries from checkpoint LSN forward
- Apply only entries from committed transactions (two-phase)
- Populate indexes in single pass after replay completes
Write-Ahead Logging (WAL)
- All modifications are recorded in the WAL before being applied
- WAL entries include transaction ID, operation type, and data
- WAL is flushed to disk for durability
- This ensures recovery in case of crashes
Snapshots
- Periodically, consistent snapshots of tables are created
- Snapshots contain the latest version of each row
- Snapshots accelerate recovery compared to replaying the entire WAL
Recovery Process
After a crash, recovery proceeds as follows:
- The latest valid snapshot is loaded for each table
- WAL entries after the snapshot are replayed
- Index definitions are restored and indexes rebuilt
- Incomplete transactions are rolled back
Schema Management
The MVCCEngine maintains the schema catalog and handles DDL operations:
Schema storage:
schemas: Arc<RwLock<FxHashMap<String, Schema>>>- Lowercase table names for case-insensitive lookups
- Schema changes recorded to WAL
ALTER TABLE operations:
AddColumn: Normalizes existing rows, adds default valuesDropColumn: Truncates row dataRenameColumn: Updates schema, preserves dataModifyColumn: Type coercion with validationRenameTable: Updates all references atomically
Schema normalization:
When reading rows, MVCCTable normalizes row data to match the current schema, handling columns added/dropped via ALTER TABLE by filling defaults or truncating.
Implementation Details
Core storage engine components in the Rust codebase:
src/storage/
├── mod.rs # Storage module entry point
├── traits/ # Storage interfaces
│ ├── engine.rs # Engine trait
│ ├── table.rs # Table trait
│ └── transaction.rs # Transaction trait
└── mvcc/ # MVCC implementation
├── engine.rs # MVCC storage engine
├── table.rs # Table with row storage
├── transaction.rs # Transaction management
├── version_store.rs # Version tracking
├── btree_index.rs # B-tree index
├── hash_index.rs # Hash index
├── bitmap_index.rs # Bitmap index
├── multi_column_index.rs # Multi-column index
└── persistence.rs # WAL and snapshots
Performance Characteristics
Read Performance
- Point Lookups - O(1) with hash index, O(log n) with B-tree
- Range Scans - O(log n + k) with B-tree index
- Full Scans - Parallelized for large tables
Write Performance
- Inserts - O(log n) per index
- Updates - O(log n) per index plus version creation
- Deletes - O(log n) per index for marker creation
Concurrency
- High Read Concurrency - MVCC enables many concurrent readers
- Write Concurrency - Multiple writers with conflict detection
- No Read Locks - Readers never block on writes