MVCC Implementation
This document provides a detailed explanation of Oxibase’s Multi-Version Concurrency Control (MVCC) implementation, which enables transaction isolation without locking.
MVCC Overview
Multi-Version Concurrency Control (MVCC) is a concurrency control method used by Oxibase to provide transaction isolation. The key principles are:
- Maintain full version chains for each row with unlimited history
- Track deletion status with transaction IDs for proper visibility
- Each transaction has a consistent view based on visibility rules
- Reads never block writes, and writes never block reads
- Implement optimistic concurrency control for conflict detection
Core Components Overview
graph TB
subgraph "Transaction Context"
MVCCTable["MVCCTable<br/>txn_id: i64<br/>cached_schema: Schema"]
TxnVerStore["Arc<RwLock<TransactionVersionStore>><br/>local_versions: Int64Map<br/>write_set: FxHashMap"]
end
subgraph "Global State"
VersionStore["VersionStore<br/>versions: ConcurrentInt64Map<br/>schema: RwLock<Schema><br/>indexes: RwLock<FxHashMap>"]
Arena["RowArena<br/>arena_rows: Vec<RowMetadata><br/>arena_data: Vec<Value>"]
end
subgraph "Version Chain"
V1["VersionChainEntry<br/>version: RowVersion<br/>prev: Option<Arc<...>><br/>arena_idx: Option<usize>"]
V2["VersionChainEntry<br/>older version"]
V3["VersionChainEntry<br/>oldest version"]
V1 -->|"Arc::clone()"| V2
V2 -->|"Arc::clone()"| V3
end
MVCCTable -->|"reads from"| VersionStore
MVCCTable -->|"writes to"| TxnVerStore
MVCCTable -->|"shares"| TxnVerStore
VersionStore -->|"stores"| V1
VersionStore -->|"uses"| Arena
V1 -.->|"arena_idx"| Arena
Design Philosophy
Oxibase implements a true multi-version MVCC design:
- Full Version Chains: Unlimited version history per row linked via
prevpointers - In-Memory Chains: Version chains built from WAL replay during recovery
- Immutable Versions: New versions always created, never modified in place
- Efficient Persistence: Only latest version persisted to disk snapshots
- Automatic Cleanup: Old versions garbage collected when no longer needed
Core Components
Transaction Registry
- Manages transaction lifecycle and state tracking
- Assigns unique transaction IDs using atomic counters
- Tracks active and committed transactions with monotonic sequences
- Supports per-transaction isolation levels without race conditions
- Implements visibility rules for both READ COMMITTED and SNAPSHOT isolation
Version Store
- Maintains full version chains for each row
- Tracks both creation (
txn_id) and deletion (deleted_at_txn_id) transaction IDs - Implements efficient concurrent access using concurrent data structures
- Manages B-tree, Hash, and Bitmap indexes
- Provides visibility-aware traversal of version chains
Row Version Structure
struct RowVersion {
txn_id: i64, // Transaction that created this version
deleted_at_txn_id: i64, // Transaction that deleted this version (0 if not deleted)
data: Row, // Complete row data
row_id: i64, // Row identifier
create_time: i64, // Timestamp when created
prev: Option<Box<RowVersion>>, // Previous version in the chain
}
The prev pointer creates a backward-linked chain from newest to oldest version.
Version Chains with Arc Pointers
graph LR
subgraph "versions: ConcurrentInt64Map"
Head["row_id: 100<br/>↓<br/>VersionChainEntry"]
end
Head -->|"prev: Arc"| V2["VersionChainEntry<br/>txn_id: 20<br/>deleted_at: 0"]
V2 -->|"prev: Arc"| V3["VersionChainEntry<br/>txn_id: 10<br/>deleted_at: 0"]
V3 -.->|"prev: None"| End["None"]
Head2["Different transaction<br/>clones Arc"] -.->|"Arc::clone()"| V2
VersionStore: Global Committed State
graph TB
subgraph "VersionStore"
Versions["versions<br/>ConcurrentInt64Map<i64, VersionChainEntry><br/>(row_id → chain head)"]
Schema["schema<br/>RwLock<Schema>"]
Indexes["indexes<br/>RwLock<FxHashMap<String, Arc<dyn Index>>>"]
Arena["arena<br/>RowArena"]
RowArenaIndex["row_arena_index<br/>RwLock<Int64Map<usize>>"]
UncommittedWrites["uncommitted_writes<br/>ConcurrentInt64Map<i64, i64><br/>(row_id → txn_id)"]
end
Versions --> Chain["Version chains"]
Arena --> ArenaRows["arena_rows: Vec<RowMetadata>"]
Arena --> ArenaData["arena_data: Vec<Value>"]
RowArenaIndex -.->|"maps row_id"| Arena
Arena-Based Storage for Zero-Copy Scans
graph TB
subgraph "RowArena Structure"
direction LR
subgraph "arena_rows: Vec<RowMetadata>"
M0["[0] RowMetadata<br/>row_id: 1<br/>start: 0, end: 3<br/>txn_id: 10"]
M1["[1] RowMetadata<br/>row_id: 2<br/>start: 3, end: 6<br/>deleted_at: 15"]
M2["[2] RowMetadata<br/>row_id: 1<br/>start: 6, end: 9<br/>txn_id: 20"]
end
subgraph "arena_data: Vec<Value>"
D0["[0] Integer(100)"]
D1["[1] Text('Alice')"]
D2["[2] Boolean(true)"]
D3["[3] Integer(200)"]
D4["[4] Text('Bob')"]
D5["[5] Boolean(false)"]
D6["[6] Integer(150)"]
D7["[7] Text('Alice')"]
D8["[8] Boolean(true)"]
end
end
M0 -.->|"[start..end]"| D0
M0 -.-> D1
M0 -.-> D2
M1 -.-> D3
M1 -.-> D4
M1 -.-> D5
M2 -.-> D6
M2 -.-> D7
M2 -.-> D8
TransactionVersionStore: Uncommitted Changes
graph TB
subgraph "TransactionVersionStore"
LocalVers["local_versions<br/>Int64Map<i64, RowVersion><br/>(row_id → pending version)"]
WriteSet["write_set<br/>FxHashMap<i64, WriteSetEntry><br/>(row_id → read version)"]
OldRows["old_rows<br/>FxHashMap<i64, Row><br/>(row_id → pre-update row)"]
end
subgraph "WriteSetEntry"
ReadVer["read_version: Option<RowVersion>"]
ReadSeq["read_version_seq: i64"]
end
WriteSet --> ReadVer
WriteSet --> ReadSeq
Conflict Detection Flow
sequenceDiagram
participant Txn as Transaction
participant TVS as TransactionVersionStore
participant VS as VersionStore
Txn->>VS: get_visible_version(row_id=1)
VS-->>Txn: RowVersion(txn_id=10, seq=50)
Txn->>TVS: put(row_id=1, new_data)
TVS->>TVS: write_set[1] = {read_version: v10, read_seq: 50}
TVS->>TVS: local_versions[1] = new_version
Note over Txn: Time passes...
Txn->>TVS: commit()
TVS->>VS: get_current_sequence()
VS-->>TVS: current_seq=75
alt read_seq (50) < current_seq (75)
TVS->>VS: get_visible_version(row_id=1, current_txn)
VS-->>TVS: RowVersion(txn_id=20, seq=60)
TVS->>TVS: Conflict! (read v10 but v20 was committed)
TVS-->>Txn: Error: write-write conflict
else read_seq == current_seq
TVS->>VS: add_versions_batch(local_versions)
TVS-->>Txn: Commit successful
end
Transaction IDs and Timestamps
Oxibase uses monotonic sequences instead of wall-clock timestamps to avoid platform-specific timing issues:
- Transaction ID: Unique identifier assigned atomically
- Begin Sequence: Monotonic sequence when transaction starts
- Commit Sequence: Monotonic sequence when transaction commits
- Write Sequences: Track when rows were last modified for conflict detection
This approach solves Windows’ 15.6ms timer resolution issue and ensures consistent ordering.
Isolation Levels
READ COMMITTED (Default)
- Transactions see committed changes immediately
- No global locks for commits - high concurrency
- Each statement sees the latest committed data
- Suitable for most OLTP workloads
Implementation:
// In READ COMMITTED, only check if transaction is committed
fn is_directly_visible(&self, version_txn_id: i64) -> bool {
self.committed_transactions.contains(version_txn_id)
}
SNAPSHOT Isolation
- Transactions see a consistent snapshot from when they started
- Write-write conflict detection prevents lost updates
- Lock-free commits with optimistic concurrency control
- High throughput with strong consistency guarantees
Implementation:
// In SNAPSHOT, check if version was committed before viewer began
fn is_visible(&self, version_txn_id: i64, viewer_txn_id: i64) -> bool {
if let Some(commit_ts) = self.committed_transactions.get(version_txn_id) {
let viewer_begin_ts = self.get_transaction_begin_seq(viewer_txn_id);
commit_ts <= viewer_begin_ts
} else {
false
}
}
Visibility Rules
Version Chain Traversal
Visibility is determined by traversing the version chain:
// Traverse the version chain from newest to oldest
let mut current = version_ptr;
while let Some(version) = current {
if registry.is_visible(version.txn_id, txn_id) {
// Check deletion visibility
if version.deleted_at_txn_id != 0
&& registry.is_visible(version.deleted_at_txn_id, txn_id) {
return None; // Deleted
}
return Some(version); // Found visible version
}
current = version.prev.as_ref();
}
Row Visibility
A row is visible to a transaction if:
- A version exists in the chain that was created by a visible transaction, AND
- That version was NOT deleted, OR the deletion is not visible
Transaction-Specific Isolation
Each transaction maintains its own isolation level:
// Set isolation level for specific transaction
registry.set_transaction_isolation_level(txn_id, level);
// Get isolation level for visibility checks
let isolation_level = registry.get_isolation_level(txn_id);
Concurrency Control
SNAPSHOT Isolation Conflicts
Write-write conflict detection during commit:
// Check for conflicts before commit
if version_store.check_write_conflict(&written_rows, begin_seq) {
return Err(OxibaseError::Transaction(
"transaction aborted due to write-write conflict".to_string()
));
}
// Set write sequences after successful commit
version_store.set_write_sequences(&written_rows, commit_seq);
Lock-Free Commit Process
SNAPSHOT commits use optimistic concurrency control:
- Check for write-write conflicts
- Generate commit sequence
- Apply changes to version stores (creating new versions)
- Set write sequences atomically
- Mark transaction as committed
No global mutex needed - conflicts detected through version checks.
Read Path: Query Execution
flowchart TD
Query["SELECT * FROM users<br/>WHERE email = 'alice@example.com'"]
Query --> TryPK("try_pk_lookup<br/>Is WHERE on PK?")
TryPK -->|Yes| PKLookup["get_visible_version(row_id)<br/>O(1) hash lookup"]
TryPK -->|No| TryIndex("try_index_lookup<br/>Is WHERE on indexed column?")
TryIndex -->|Yes| IndexPath["Index lookup"]
TryIndex -->|No| FullScan["Full table scan"]
IndexPath --> HashIndex("Index type?")
HashIndex -->|Hash| HashFind["hash_to_rows.get(hash)<br/>O(1)"]
HashIndex -->|BTree| BTreeFind["sorted_values.range()<br/>O(log n + k)"]
HashIndex -->|Bitmap| BitmapFind["bitmap.and_count()<br/>O(n/64)"]
HashFind --> RowIDs["Vec<i64> row_ids"]
BTreeFind --> RowIDs
BitmapFind --> RowIDs
RowIDs --> BatchGet["get_visible_versions_batch(row_ids)"]
PKLookup --> CheckVis["Check visibility"]
BatchGet --> CheckVis
FullScan --> ArenaIter["get_all_visible_rows_arena()"]
ArenaIter --> CheckVis
CheckVis --> Normalize["normalize_row_to_schema()<br/>(handle ALTER TABLE)"]
Normalize --> Results["Vec<(i64, Row)>"]
Write Path: Insert, Update, Delete
flowchart TD
Insert["MVCCTable::insert(row)"]
Update["MVCCTable::update(filter, updates)"]
Delete["MVCCTable::delete(filter)"]
Insert --> ExtractPK["extract_row_pk(row)<br/>Auto-increment if needed"]
ExtractPK --> CheckUnique["check_unique_constraints(row)"]
CheckUnique --> AddLocal["txn_versions.put(row_id, version)"]
Update --> ScanRows["scan(filter)<br/>Get matching row_ids"]
ScanRows --> BatchGetUpdate["get_visible_versions_for_update()<br/>Read + track in write_set"]
BatchGetUpdate --> ApplyUpdates["Apply SET clauses"]
ApplyUpdates --> BatchPut["Batch: txn_versions.put(...)"]
Delete --> ScanDel["scan(filter)"]
ScanDel --> MarkDeleted["txn_versions.delete(row_id)"]
AddLocal --> LocalStore["TransactionVersionStore<br/>local_versions"]
BatchPut --> LocalStore
MarkDeleted --> LocalStore
LocalStore -.->|"Later: commit()"| FlushGlobal["Flush to VersionStore"]
FlushGlobal --> UpdateIndexes["Update all indexes"]
UpdateIndexes --> ZoneMap["Mark zone_maps_stale"]
Version Chain Management
Updates
When a row is updated:
- A new version is created with the updating transaction’s ID
- The new version’s
prevpointer links to the current version - The version chain grows backward in time
- All historical versions remain accessible
Deletions
When a row is deleted:
- A new version is created with
DeletedAtTxnIDset - The deletion version links to the previous version
- Data is preserved in the deletion version
- The row appears deleted to transactions that see this version
Version Chain Example
[Newest] -> Version 3 (TxnID=300, DeletedAt=400)
|
v
Version 2 (TxnID=200)
|
v
Version 1 (TxnID=100) -> [Oldest]
Row Normalization for Schema Evolution
graph LR
OldRow["Old Row (2 columns)<br/>[100, 'Alice']"]
NewSchema["Current Schema (3 columns)<br/>id, name, age"]
OldRow --> Normalize["normalize_row_to_schema()"]
NewSchema --> Normalize
Normalize --> Check["row.len() vs schema.len()"]
Check -->|"row.len() < schema"| AddDefaults["Append default values<br/>or NULLs"]
Check -->|"row.len() > schema"| Truncate["Truncate extra columns"]
Check -->|"Equal"| NoOp["Return as-is"]
AddDefaults --> NewRow["[100, 'Alice', NULL]"]
Truncate --> NewRow2["Truncated row"]
NoOp --> NewRow3["Original row"]
Performance Optimizations
Lock-Free Data Structures
SegmentInt64Map: High-performance concurrent maps- Atomic operations for counters and flags
- Minimal mutex usage in hot paths
Object Pooling
- Transaction objects
- Table objects
- Version maps
- Reduces GC pressure in high-throughput scenarios
Optimized Visibility Checks
- Fast path for own-transaction visibility
- Direct visibility check for READ COMMITTED
- Batch processing for bulk operations
Memory Management
- Version chains built on-demand from WAL
- Automatic cleanup of old versions no longer needed
- Periodic cleanup of deleted rows
- Cold data eviction to disk
Garbage Collection
Version Chain Cleanup
Old versions in chains are cleaned up when:
- No active transaction can see them
- A newer version is visible to all active transactions
// Find oldest version still needed
let mut current = version;
let mut last_needed = None;
while let Some(v) = current {
for txn_id in &active_transactions {
if registry.is_visible(v.txn_id, *txn_id) {
last_needed = Some(v);
break;
}
}
current = v.prev.as_ref();
}
// Disconnect older versions (Rust's ownership handles cleanup)
if let Some(last) = last_needed {
last.prev = None; // Drop older versions
}
Deleted Row Cleanup
Deleted rows are removed based on:
- Retention period (age-based)
- Transaction visibility (no active transaction can see them)
- Safety checks to prevent removing visible data
Key Implementation Files
src/storage/mvcc/engine.rs- MVCC engine coordinating all componentssrc/storage/mvcc/table.rs- Table operations with MVCC supportsrc/storage/mvcc/transaction.rs- Transaction implementation and conflict detectionsrc/storage/mvcc/version_store.rs- Row version storage and management
Best Practices
- Choose Appropriate Isolation: Use READ COMMITTED unless you need snapshot consistency
- Keep Transactions Short: Long transactions delay garbage collection
- Handle Conflicts: Implement retry logic for SNAPSHOT conflicts
- Monitor Deleted Rows: Ensure garbage collection keeps up with deletions
- Batch Operations: Group related changes in single transactions
Future Improvements
Potential enhancements to the current design:
- Time Travel Queries: Query data as of specific timestamps
- Additional Isolation Levels: REPEATABLE READ, SERIALIZABLE
- Read-Set Tracking: Detect read-write conflicts for SERIALIZABLE
- Savepoint Support: Transaction savepoints for partial rollbacks
- Version Compression: Delta encoding for version chains