float performance tester, for demo only
This PR has a standalone test program that demonstrates some interesting performance characteristics of different data access patterns. It's not intended to be merged; I'm just putting it up here for visibility and discussion.
I have been working on improving the performance of vector KNN search, and in particular working on closing the gap with the nmslib/hnswlib reference implementation. hnswlib is coded in C++, but I believe a pure Java implementation should be able to provide pretty close performance. My goal has been to get within 2x, but we're still pretty far off from that, maybe 8x difference. Looking at the profiler, we seem to spend all our time in dot product computation, which is expected. So I wrote a simple benchmark to look more closely at this aspect and stumbled on something that really surprised me, demonstrated by this program.
The speed of the bulk dot product computation we do (a thousand or so per query, for an index of 1M vectors) is heavily influenced by memory access patterns. I'm guessing it has something to do with ability to use the CPU's memory caches, avoiding the need to go out to main memory.
In this micro-benchmark I compared a few different memory access patterns, computing 1M dot products across pairs of vectors taken from the same set. The fastest is to load all floats into a single contiguous on-heap array, and access that via pointers (which is like what hnswlib does). I compared that with various other models, including something simulating what we do today in Lucene, memory mapping a file, reading it as bytes, and converting that into a float array for each access. If we access the vector data sequentially, there is a 4x difference in speed, but even for random access there is nearly a 2x difference.
The MANY ARRAYS case pre-loads all vectors on heap, but stores them in separate arrays per vector, rather than in a single contiguous array. The SKIP ONE COPY case is like the BASELINE, but simulates what we might see if we implemented IndexInput.readFloats, so we could avoid one array copy that's needed today.
random access
pattern time/iteration BASELINE 0.594572 us SKIP ONE COPY 0.401249 us MANY ARRAYS 0.393746 us ONE BIG ARRAY 0.330135 us
sequential access
pattern time/iteration BASELINE 0.443061 us MANY ARRAYS 0.188859 us SKIP ONE COPY 0.154549 us ONE BIG ARRAY 0.109249 us
At @rmuir's suggestion, I re-coded as JMH benchmark, with a similar result:
Benchmark Mode Cnt Score Error Units
TestIndexInputReadFloats.testIndexInputReadFloats thrpt 25 1371.856 ± 11.681 ops/s
TestLuceneBaseline.testLuceneBaseline thrpt 25 572.370 ± 3.918 ops/s
TestOnHeapArray.testOnHeapArray thrpt 25 1915.688 ± 1.233 ops/s
It seems as if a large benefit can be had by implementing IndexInput.readFloats, so I'll work that up, but there's still some fairly big gap with on-heap arrays. Maybe some kind of loop-unrolling / vectorization? I'll look at asm and see what I can tease out of that.