arcadedb icon indicating copy to clipboard operation
arcadedb copied to clipboard

HNSW Vector Index Performance Issue with Batched Transactions

Open tae898 opened this issue 1 month ago • 2 comments

Problem

Populating an HNSW vector index using batched transactions causes severe performance degradation (5-10x slower) compared to a single large transaction.

Performance Comparison:

  • Single transaction: ~278 vectors/sec (stable)
  • Batched transactions: 781 → 149 → 97 → 64 vectors/sec (progressive slowdown)

Root Cause

Each transaction commit forces:

  • Disk I/O for index metadata
  • Page cache invalidation
  • HNSW graph state persistence/reload

Single transactions keep all HNSW updates in memory until final commit, avoiding repeated disk flushes.

Reproducible Example

Dataset: 9,742 vectors (384 dimensions)

// FAST: Single transaction (35.0s)
db.begin();
for (Vertex v : vertices) { index.add(v); }
db.commit();

// SLOW: Batched transactions (5-10x slower)
for (batch : batches) {
  db.begin();
  for (Vertex v : batch) { index.add(v); }
  db.commit();
}

Impact

  • Bulk indexing: Re-indexing takes 5-10x longer with batching
  • Memory vs speed trade-off: Users batch to avoid OOM but performance becomes unusable
  • Scalability: Slowdown is exponential with dataset size

Suggested Solutions

Option 1: Buffer HNSW updates in memory

  • Defer edge persistence until explicit flush or buffer threshold
  • Add index.flush() method for manual control

Option 2: Transaction-aware optimization

  • Optimize commit path when multiple add() calls occur in same transaction
  • Skip intermediate persistence, flush once at transaction end

Option 3: Document the requirement

  • Warn users that bulk indexing requires single transaction
  • Note OOM risk for very large datasets

Environment

  • ArcadeDB 24.11.1
  • HNSW: m=16, ef=128
  • JVM: 8GB heap

tae898 avatar Nov 10 '25 11:11 tae898

@tae898 could you please try the same test with the new LSM Vector?

lvca avatar Dec 09 '25 18:12 lvca

@tae898 could you please try the same test with the new LSM Vector?

yes I will! excited for the new vector implementation.

tae898 avatar Dec 09 '25 21:12 tae898