arcadedb
arcadedb copied to clipboard
HNSW Vector Index Performance Issue with Batched Transactions
Problem
Populating an HNSW vector index using batched transactions causes severe performance degradation (5-10x slower) compared to a single large transaction.
Performance Comparison:
- Single transaction: ~278 vectors/sec (stable)
- Batched transactions: 781 → 149 → 97 → 64 vectors/sec (progressive slowdown)
Root Cause
Each transaction commit forces:
- Disk I/O for index metadata
- Page cache invalidation
- HNSW graph state persistence/reload
Single transactions keep all HNSW updates in memory until final commit, avoiding repeated disk flushes.
Reproducible Example
Dataset: 9,742 vectors (384 dimensions)
// FAST: Single transaction (35.0s)
db.begin();
for (Vertex v : vertices) { index.add(v); }
db.commit();
// SLOW: Batched transactions (5-10x slower)
for (batch : batches) {
db.begin();
for (Vertex v : batch) { index.add(v); }
db.commit();
}
Impact
- Bulk indexing: Re-indexing takes 5-10x longer with batching
- Memory vs speed trade-off: Users batch to avoid OOM but performance becomes unusable
- Scalability: Slowdown is exponential with dataset size
Suggested Solutions
Option 1: Buffer HNSW updates in memory
- Defer edge persistence until explicit flush or buffer threshold
- Add
index.flush()method for manual control
Option 2: Transaction-aware optimization
- Optimize commit path when multiple
add()calls occur in same transaction - Skip intermediate persistence, flush once at transaction end
Option 3: Document the requirement
- Warn users that bulk indexing requires single transaction
- Note OOM risk for very large datasets
Environment
- ArcadeDB 24.11.1
- HNSW: m=16, ef=128
- JVM: 8GB heap
@tae898 could you please try the same test with the new LSM Vector?
@tae898 could you please try the same test with the new LSM Vector?
yes I will! excited for the new vector implementation.