Storage Engine

This document provides a detailed overview of Oxibase’s storage engine, including its design principles, components, and how data is stored and retrieved.

Storage Engine Design

Oxibase’s storage engine is designed with the following principles:

Memory-optimized - Prioritizes in-memory performance with optional persistence
MVCC-based - Uses multi-version concurrency control for transaction isolation
Version-organized - Tracks different versions of rows for transaction isolation
Type-specialized - Uses different strategies for different data types
Index-accelerated - Multiple index types to optimize different query patterns

Architecture Overview

graph TB
    subgraph "Engine Layer - Orchestration"
        MVCCEngine["MVCCEngine<br/>src/storage/mvcc/engine.rs<br/>• Table lifecycle<br/>• Transaction registry<br/>• Persistence manager<br/>• Schema catalog"]

        TransactionRegistry["TransactionRegistry<br/>• Transaction ID allocation<br/>• Commit sequence numbers<br/>• Visibility checking"]

        PersistenceManager["PersistenceManager<br/>• WAL management<br/>• Snapshot creation<br/>• Recovery coordination"]
    end

    subgraph "Table Layer - Transaction-aware Access"
        MVCCTable["MVCCTable<br/>src/storage/mvcc/table.rs<br/>• Transaction isolation<br/>• Row operations<br/>• Index selection<br/>• Commit/rollback"]

        TransactionVersionStore["TransactionVersionStore<br/>• Uncommitted changes<br/>• Write-set tracking<br/>• Conflict detection"]
    end

    subgraph "Storage Layer - Version Management"
        VersionStore["VersionStore<br/>src/storage/mvcc/version_store.rs<br/>• Version chains<br/>• Visibility filtering<br/>• Arena-based scanning<br/>• Zone maps"]

        RowArena["RowArena<br/>• Contiguous memory storage<br/>• 50x+ scan speedup<br/>• Zero-copy reads"]

        VersionChainEntry["VersionChainEntry<br/>• Current version<br/>• Arc prev pointer<br/>• Arena index"]
    end

    subgraph "Index Layer"
        BTreeIndex["BTreeIndex<br/>src/storage/mvcc/btree_index.rs<br/>• Range queries O(log n + k)<br/>• Cached min/max"]

        HashIndex["HashIndex<br/>src/storage/mvcc/hash_index.rs<br/>• Equality O(1)<br/>• ahash for TEXT/JSON"]

        BitmapIndex["BitmapIndex<br/>• Boolean columns<br/>• Bitwise AND/OR/NOT"]

        MultiColumnIndex["MultiColumnIndex<br/>• Composite key lookups<br/>• Prefix matching"]
    end

    MVCCEngine -->|creates| MVCCTable
    MVCCEngine -->|manages| TransactionRegistry
    MVCCEngine -->|uses| PersistenceManager

    MVCCTable -->|reads/writes| TransactionVersionStore
    MVCCTable -->|delegates to| VersionStore

    TransactionVersionStore -->|commits to| VersionStore

    VersionStore -->|stores versions in| RowArena
    VersionStore -->|maintains| VersionChainEntry
    VersionStore -->|updates| BTreeIndex
    VersionStore -->|updates| HashIndex
    VersionStore -->|updates| BitmapIndex
    VersionStore -->|updates| MultiColumnIndex

    TransactionRegistry -.checks visibility.-> VersionStore

Core Components and Responsibilities

The storage engine consists of four main classes that handle different aspects of data storage and retrieval:

Component	File	Primary Responsibilities
`MVCCEngine`	engine.rs:254-279	Database lifecycle, table creation/deletion, transaction coordination, schema management, persistence orchestration
`MVCCTable`	table.rs:36-45	Per-transaction table view, row insert/update/delete, index lookup optimization, commit/rollback execution
`VersionStore`	version_store.rs:157-193	Version chain management, visibility checking, arena-based scanning, row counting, index population
`TransactionVersionStore`	version_store.rs:1354-1414	Uncommitted changes buffer, write-set conflict tracking, local version cache

Storage Engine Data Flow

This diagram shows how a typical transaction operation flows through the storage layers, from the public API down to the physical storage structures.

sequenceDiagram
    participant Client
    participant Transaction as "Transaction<br/>(MvccTransaction)"
    participant Engine as "MVCCEngine"
    participant Table as "MVCCTable"
    participant TxnStore as "TransactionVersionStore"
    participant VerStore as "VersionStore"
    participant Index as "BTreeIndex/HashIndex"
    participant WAL as "WALManager"

    Client->>Engine: begin_transaction()
    Engine->>TransactionRegistry: allocate txn_id
    Engine->>Transaction: MvccTransaction(txn_id)

    Client->>Transaction: execute("INSERT INTO users VALUES (...)")
    Transaction->>Engine: get_table_for_transaction("users")
    Engine->>Table: MVCCTable::new_with_shared_store()

    Transaction->>Table: insert(row)
    Table->>TxnStore: add_local_version(row_version)
    Note over TxnStore: Buffered in memory<br/>Not yet visible

    Client->>Transaction: commit()
    Transaction->>Table: commit()

    Table->>Index: update indexes (old_value, new_value)
    Table->>TxnStore: commit()
    TxnStore->>VerStore: add_versions_batch()

    VerStore->>RowArena: insert_row()
    Note over VerStore: Versions now visible<br/>to new transactions

    VerStore->>WAL: record_operation()
    WAL-->>VerStore: LSN

    Table-->>Transaction: Ok
    Transaction->>TransactionRegistry: mark_committed(commit_seq)
    Transaction-->>Client: Ok

Storage Components

Table Structure

Tables in Oxibase are composed of:

Metadata - Schema information, column definitions, and indexes
Row Data - The primary data storage, organized by row
Version Store - Tracks row versions for MVCC
Indexes - B-tree, Hash, Bitmap, and multi-column indexes
Transaction Manager - Manages transaction state and visibility

Data Types

Oxibase supports a variety of data types, each with optimized storage:

INTEGER - 64-bit signed integers
FLOAT - 64-bit floating-point numbers
TEXT - Variable-length UTF-8 strings
BOOLEAN - Boolean values (true/false)
TIMESTAMP - Date and time values
JSON - JSON documents
NULL - Null values supported for all types

Version Management

Oxibase tracks different versions of data for transaction isolation:

Each change creates a new version rather than overwriting
Versions are associated with transaction IDs
Visibility rules determine which versions each transaction can see
Old versions are garbage collected when no longer needed

Data Storage Format

In-Memory Format

In memory, data is stored with these characteristics:

Row-based primary storage - Records are stored as coherent rows
Version chains - Linked versions for MVCC
Type-specific indexes - B-tree, Hash, Bitmap based on column type
Efficient structures - Optimized for different data types

On-Disk Format

When persistence is enabled, data is stored on disk with:

Binary serialization - Compact binary format for storage
WAL files - Write-ahead log for durability
Snapshot files - Point-in-time table snapshots
Metadata files - Schema and index information

MVCC Implementation

The storage engine uses MVCC to provide transaction isolation:

Full Version Chains - Version history per row linked via pointers
Transaction IDs - Each version is associated with a transaction ID
Visibility Rules - Traverse version chains to find visible versions
Lock-Free Reads - Readers never block writers
Automatic Cleanup - Old versions garbage collected when no longer needed

For more details, see the MVCC Implementation and Transaction Isolation documentation.

Version Chain Management

Oxibase implements MVCC by maintaining linked version chains for each row. Each VersionChainEntry contains a RowVersion and an Arc pointer to the previous version, enabling O(1) chain cloning for snapshot isolation.

graph LR
    subgraph "versions: ConcurrentInt64Map<VersionChainEntry>"
        row1["row_id: 1"]
        row2["row_id: 2"]
        row3["row_id: 3"]
    end

    subgraph "Version Chain for row_id=1"
        v3["VersionChainEntry<br/>txn_id: 30<br/>deleted_at: 30<br/>arena_idx: None"]
        v2["VersionChainEntry<br/>txn_id: 20<br/>deleted_at: 0<br/>arena_idx: Some(15)"]
        v1["VersionChainEntry<br/>txn_id: 10<br/>deleted_at: 0<br/>arena_idx: Some(5)"]

        v3 -->|"prev: Arc"| v2
        v2 -->|"prev: Arc"| v1
        v1 -->|"prev: None"| end1["Initial version"]
    end

    subgraph "RowArena - Contiguous Memory"
        arena_idx_5["idx 5: row_id=1, txn=10, data=[...]"]
        arena_idx_15["idx 15: row_id=1, txn=20, data=[...]"]
        arena_other["Other rows..."]
    end

    row1 --> v3
    v1 -.arena_idx: Some(5).-> arena_idx_5
    v2 -.arena_idx: Some(15).-> arena_idx_15

Key data structures:

versions: ConcurrentInt64Map<VersionChainEntry> - Maps row_id to the head of its version chain
VersionChainEntry - Linked list node with version: RowVersion, prev: Option<Arc<...>>, and arena_idx: Option<usize>
RowArena - Contiguous memory storage providing 50x+ faster full table scans via zero-copy reads

The use of Arc for the prev pointer enables cheap snapshot cloning—creating a snapshot of a table is O(1) because the chain is reference-counted rather than deep-copied.

Transaction Version Store

The TransactionVersionStore buffers uncommitted changes within a transaction, maintaining write-set information for conflict detection during commit.

Field	Type	Purpose
`local_versions`	`Int64Map<RowVersion>`	Uncommitted inserts/updates/deletes
`write_set`	`Int64Map<WriteSetEntry>`	Conflict detection: tracks read version and sequence number
`version_store`	`Arc<VersionStore>`	Reference to global committed storage
`txn_id`	`i64`	Transaction identifier

Commit process:

Check uniqueness constraints on all indexes
Update indexes with (old_value, new_value) pairs
Flush local_versions to VersionStore in batch
Mark zone maps as stale (for optimizer statistics)

Rollback process:

Discard all entries in local_versions
Clear write_set tracking data

Index Architecture

Oxibase provides three index types, each optimized for different query patterns. The MVCCTable automatically selects the optimal index type based on column data types.

graph TB
    subgraph "Index Selection Logic"
        auto_select["MVCCTable::auto_select_index_type()<br/>table.rs:105-130"]
    end

    subgraph "Index Type Decision Tree"
        check_bool["Is column BOOLEAN?"]
        check_text["Is column TEXT/JSON?"]
        check_numeric["Is column INTEGER/FLOAT/TIMESTAMP?"]

        bitmap_idx["IndexType::Bitmap<br/>Fast AND/OR/NOT<br/>Low cardinality"]
        hash_idx["IndexType::Hash<br/>O(1) equality<br/>Avoids O(strlen)"]
        btree_idx["IndexType::BTree<br/>O(log n) range queries<br/>Cached min/max"]
    end

    subgraph "Index Implementations"
        BTreeIndexImpl["BTreeIndex<br/>btree_index.rs:74-113<br/>• BTreeMap: sorted values<br/>• AHashMap: O(1) equality<br/>• FxHashMap: row->value"]
        HashIndexImpl["HashIndex<br/>hash_index.rs:105-127<br/>• ahash: fast hashing<br/>• SmallVec: inline storage<br/>• Collision handling"]
        BitmapIndexImpl["BitmapIndex<br/>• Roaring bitmaps<br/>• Bitwise operations<br/>• Compressed storage"]
    end

    auto_select --> check_bool
    check_bool -->|Yes, single column| bitmap_idx
    check_bool -->|No| check_text
    check_text -->|Yes| hash_idx
    check_text -->|No| check_numeric
    check_numeric -->|Yes| btree_idx

    bitmap_idx --> BitmapIndexImpl
    hash_idx --> HashIndexImpl
    btree_idx --> BTreeIndexImpl

Index selection rules:

BOOLEAN → BitmapIndex for fast bitwise AND/OR/NOT operations
TEXT/JSON → HashIndex to avoid O(strlen) comparisons in B-tree nodes
INTEGER/FLOAT/TIMESTAMP → BTreeIndex for range query support
Mixed types → BTreeIndex as safe default

BTreeIndex Internal Structure

The BTreeIndex maintains dual data structures for optimal performance across different query types:

Data Structure	Type	Purpose	Complexity
`sorted_values`	`RwLock<BTreeMap<Value, RowIdSet>>`	Range queries, MIN/MAX	O(log n + k)
`value_to_rows`	`RwLock<AHashMap<Value, RowIdSet>>`	Equality lookups	O(1)
`row_to_value`	`RwLock<FxHashMap<i64, Value>>`	Removal by row_id	O(1)
`cached_min`	`RwLock<Option<Value>>`	MIN aggregate	O(1)
`cached_max`	`RwLock<Option<Value>>`	MAX aggregate	O(1)

The dual-index strategy (BTreeMap + HashMap) trades memory (~2x for unique values) for optimal query performance: O(1) equality via hash lookup and O(log n + k) range queries via sorted iteration.

HashIndex Characteristics

The HashIndex is designed for high-cardinality TEXT columns where equality queries dominate:

Advantages:

O(1) exact match via ahash (faster than SipHash)
Avoids O(strlen) string comparisons in B-tree traversal
SmallVec<[i64; 4]> reduces allocations for unique indexes

Limitations:

Does NOT support range queries
Does NOT support ORDER BY optimization
Requires exact match on all indexed columns (no partial key lookups)

Triple-lock write pattern:

add() acquires:
hash_to_rows: Write
row_to_hash: Write
hash_to_values: Write

This ensures atomicity but serializes concurrent writes. Read operations (find) only acquire a single read lock for minimal contention.

Row Arena: Zero-Copy Scanning

The RowArena stores row data in contiguous memory to eliminate per-row allocation overhead during full table scans:

Performance characteristics:

50x+ speedup for full table scans vs. per-row cloning
Pre-acquire locks once per scan (O(1) instead of O(N))
Direct slice access via unsafe bounds-checked reads
Cache locality from contiguous layout

Arena structure:

arena_rows: Vec<RowMetadata>  // [start, end, row_id, txn_id, create_time, deleted_at]
arena_data: Vec<Value>         // Flattened row data
row_arena_index: Int64Map<usize>  // row_id -> arena_idx

Usage pattern:

Acquire arena_rows and arena_data read guards once
Iterate version chains checking visibility
For visible versions, read directly from arena via arena_idx
Release locks after iteration completes
Sort results by row_id (if needed)

Data Access Paths

Point Lookups

For point queries (e.g., WHERE id = 5):

Use primary key or index to locate the row
Apply visibility rules based on transaction
Return the visible version

Range Scans

For range queries (e.g., WHERE price > 100):

Use B-tree index if available for the column
Scan matching index entries
Apply visibility rules to each row
Return visible results

Full Table Scans

For queries without applicable indexes:

Scan all rows in the table
Apply WHERE clause filters
Apply visibility rules
For large tables, parallelize the scan

Data Modification

Insert Operations

When data is inserted:

Values are validated against column types
A new row version is created with the current transaction ID
The row is added to the primary row storage
Indexes are updated
The operation is recorded in the WAL (if enabled)

Update Operations

When data is updated:

The existing row is located via indexes or scan
A new version is created with updated values
The new version links to the previous version
Indexes are updated to reflect the changes
The operation is recorded in the WAL (if enabled)

Delete Operations

When data is deleted:

The existing row is located
A deletion marker version is created
Indexes are updated to reflect the deletion
The operation is recorded in the WAL (if enabled)

Persistence and Recovery

When persistence is enabled:

Write-Ahead Logging (WAL)

The storage engine provides crash consistency through Write-Ahead Logging (WAL) and periodic snapshots:

WAL Manager responsibilities:

Sequential writes with CRC32 checksums
Two-phase recovery: Phase 1 identifies committed transactions, Phase 2 applies their changes
Compression (optional) for reduced I/O

Snapshot system:

Binary format with magic bytes (0x50414E53 = “SNAP”)
Atomic 3-phase writes: temp file → sync → rename
Tracks source LSN for incremental recovery

Recovery flow:

Read snapshot metadata to get checkpoint LSN
Load table snapshots (fastest recovery path)
Replay WAL entries from checkpoint LSN forward
Apply only entries from committed transactions (two-phase)
Populate indexes in single pass after replay completes

Write-Ahead Logging (WAL)

All modifications are recorded in the WAL before being applied
WAL entries include transaction ID, operation type, and data
WAL is flushed to disk for durability
This ensures recovery in case of crashes

Snapshots

Periodically, consistent snapshots of tables are created
Snapshots contain the latest version of each row
Snapshots accelerate recovery compared to replaying the entire WAL

Recovery Process

After a crash, recovery proceeds as follows:

The latest valid snapshot is loaded for each table
WAL entries after the snapshot are replayed
Index definitions are restored and indexes rebuilt
Incomplete transactions are rolled back

Schema Management

The MVCCEngine maintains the schema catalog and handles DDL operations:

Schema storage:

schemas: Arc<RwLock<FxHashMap<String, Schema>>>
Lowercase table names for case-insensitive lookups
Schema changes recorded to WAL

ALTER TABLE operations:

AddColumn: Normalizes existing rows, adds default values
DropColumn: Truncates row data
RenameColumn: Updates schema, preserves data
ModifyColumn: Type coercion with validation
RenameTable: Updates all references atomically

Schema normalization: When reading rows, MVCCTable normalizes row data to match the current schema, handling columns added/dropped via ALTER TABLE by filling defaults or truncating.

Implementation Details

Core storage engine components in the Rust codebase:

src/storage/
├── mod.rs              # Storage module entry point
├── traits/             # Storage interfaces
│   ├── engine.rs       # Engine trait
│   ├── table.rs        # Table trait
│   └── transaction.rs  # Transaction trait
└── mvcc/               # MVCC implementation
    ├── engine.rs       # MVCC storage engine
    ├── table.rs        # Table with row storage
    ├── transaction.rs  # Transaction management
    ├── version_store.rs # Version tracking
    ├── btree_index.rs  # B-tree index
    ├── hash_index.rs   # Hash index
    ├── bitmap_index.rs # Bitmap index
    ├── multi_column_index.rs # Multi-column index
    └── persistence.rs  # WAL and snapshots

Performance Characteristics

Read Performance

Point Lookups - O(1) with hash index, O(log n) with B-tree
Range Scans - O(log n + k) with B-tree index
Full Scans - Parallelized for large tables

Write Performance

Inserts - O(log n) per index
Updates - O(log n) per index plus version creation
Deletes - O(log n) per index for marker creation

Concurrency

High Read Concurrency - MVCC enables many concurrent readers
Write Concurrency - Multiple writers with conflict detection
No Read Locks - Readers never block on writes