float performance tester, for demo only

Open msokolov opened this issue 5 years ago • 1 comments

This PR has a standalone test program that demonstrates some interesting performance characteristics of different data access patterns. It's not intended to be merged; I'm just putting it up here for visibility and discussion.

I have been working on improving the performance of vector KNN search, and in particular working on closing the gap with the nmslib/hnswlib reference implementation. hnswlib is coded in C++, but I believe a pure Java implementation should be able to provide pretty close performance. My goal has been to get within 2x, but we're still pretty far off from that, maybe 8x difference. Looking at the profiler, we seem to spend all our time in dot product computation, which is expected. So I wrote a simple benchmark to look more closely at this aspect and stumbled on something that really surprised me, demonstrated by this program.

The speed of the bulk dot product computation we do (a thousand or so per query, for an index of 1M vectors) is heavily influenced by memory access patterns. I'm guessing it has something to do with ability to use the CPU's memory caches, avoiding the need to go out to main memory.

In this micro-benchmark I compared a few different memory access patterns, computing 1M dot products across pairs of vectors taken from the same set. The fastest is to load all floats into a single contiguous on-heap array, and access that via pointers (which is like what hnswlib does). I compared that with various other models, including something simulating what we do today in Lucene, memory mapping a file, reading it as bytes, and converting that into a float array for each access. If we access the vector data sequentially, there is a 4x difference in speed, but even for random access there is nearly a 2x difference.

The MANY ARRAYS case pre-loads all vectors on heap, but stores them in separate arrays per vector, rather than in a single contiguous array. The SKIP ONE COPY case is like the BASELINE, but simulates what we might see if we implemented IndexInput.readFloats, so we could avoid one array copy that's needed today.

random access

pattern time/iteration BASELINE 0.594572 us SKIP ONE COPY 0.401249 us MANY ARRAYS 0.393746 us ONE BIG ARRAY 0.330135 us

sequential access

pattern time/iteration BASELINE 0.443061 us MANY ARRAYS 0.188859 us SKIP ONE COPY 0.154549 us ONE BIG ARRAY 0.109249 us

Dec 30 '20 13:12 msokolov

At @rmuir's suggestion, I re-coded as JMH benchmark, with a similar result:

Benchmark                                           Mode  Cnt     Score    Error  Units
TestIndexInputReadFloats.testIndexInputReadFloats  thrpt   25  1371.856 ± 11.681  ops/s
TestLuceneBaseline.testLuceneBaseline              thrpt   25   572.370 ±  3.918  ops/s
TestOnHeapArray.testOnHeapArray                    thrpt   25  1915.688 ±  1.233  ops/s

It seems as if a large benefit can be had by implementing IndexInput.readFloats, so I'll work that up, but there's still some fairly big gap with on-heap arrays. Maybe some kind of loop-unrolling / vectorization? I'll look at asm and see what I can tease out of that.

Dec 30 '20 20:12 msokolov