luceneutil
luceneutil copied to clipboard
What aKNN dimensionality should we use in nightly benchmark?
(Spinoff from discussions with @rmuir and in trying to test https://github.com/apache/lucene/pull/12311 in luceneutil).
Currently the nightly benchmark tests 100 dimensions, but this seems not common/realistic since 1) it is not a power of 2, and 2) it is maybe smallish? Should we pick a more "realistic" number (what)?
In the Vector optimizations, a power-of-two can execute faster than non-power-of-two since the "unrolling" won't need to do any scalar math and can use more pure SIMD.
In fact, since power-of-two can be faster, maybe for some methods we should zero-pad? E.g. dotProduct
will give the same (ish?) answer if you pad with 0s to the nearest/fastest power of two?
Zero-padding should be done in the implementation method, not sure if we should really add the overhead (also because of heap space to all algorithms in Lucene). I am not sure how much the Vector API of Panama offers?
I think it is enough to just use a bigger vector size that better represents the performance issues? Maybe it looks like the current graph for users only using 100 or 256 and they don't complain about the performance?
But try using 756 or 1024, it is a different story. And users are screaming bloody murder wanting thousands and thousands for chatgpt but we are over here benchmarking 100.
I was able to measure a larger speedup with 768-dim vectors than lower-dimensional ones. For these tests I increased tasksPerCat=20 (from default of 1) to give more chance to warm up. This is pretty slow though, and I think if we did this, we wouldn't need to run 20 JVMs (results seem to converge pretty fast).
768-dimensions, trained using mpnet model
TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value
PKLookup 206.34 (2.0%) 208.62 (1.5%) 1.1% ( -2% - 4%) 0.046
HighTermVector 73.82 (3.2%) 84.85 (1.6%) 14.9% ( 9% - 20%) 0.000
AndHighMedVector 64.59 (3.4%) 75.64 (1.7%) 17.1% ( 11% - 23%) 0.000
AndHighHighVector 71.74 (3.3%) 88.23 (1.9%) 23.0% ( 17% - 29%) 0.000
AndHighLowVector 74.30 (3.4%) 94.13 (2.1%) 26.7% ( 20% - 33%) 0.000
MedTermVector 64.84 (3.4%) 83.01 (1.9%) 28.0% ( 22% - 34%) 0.000
LowTermVector 70.95 (3.2%) 94.43 (2.0%) 33.1% ( 26% - 39%) 0.000
Here is a JFR profile from that run
PERCENT CPU SAMPLES STACK
49.24% 209604 org.apache.lucene.util.VectorUtil#dotProduct()
2.21% 9405 jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
2.02% 8590 org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsReader$OffHeapHnswGraph#seek()
1.94% 8264 org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
1.70% 7249 perf.VectorDictionary#vectorDiv()
1.69% 7198 jdk.internal.misc.Unsafe#copyMemoryChecks()
1.55% 6612 org.apache.lucene.util.packed.DirectMonotonicReader#get()
1.31% 5588 org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#vectorValue()
1.24% 5265 org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
1.10% 4698 java.lang.foreign.MemorySegment#copy()
1.00% 4261 jdk.internal.foreign.MemorySessionImpl#toSessionImpl()
1.00% 4259 jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()
0.98% 4181 java.lang.invoke.VarHandleGuards#guard_LJ_I()
0.98% 4166 org.apache.lucene.store.DataInput#readVInt()
0.96% 4104 org.apache.lucene.util.SparseFixedBitSet#insertLong()
0.84% 3581 org.apache.lucene.util.LongHeap#upHeap()
0.83% 3536 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#readShort()
0.82% 3477 java.util.Arrays#binarySearch0()
0.82% 3473 org.apache.lucene.util.packed.DirectReader$DirectPackedReader16#get()
0.80% 3417 java.lang.Object#equals()
0.80% 3405 jdk.internal.misc.Unsafe#checkPrimitivePointer()
0.77% 3262 java.util.Objects#requireNonNull()
0.76% 3255 org.apache.lucene.util.LongHeap#downHeap()
0.69% 2944 jdk.jfr.internal.JVM#emitEvent()
0.67% 2847 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
0.62% 2650 org.apache.lucene.util.LongHeap#push()
0.62% 2632 java.lang.invoke.VarHandleSegmentAsBytes#get()
0.58% 2452 org.apache.lucene.store.MemorySegmentIndexInput#readByte()
0.57% 2443 org.apache.lucene.util.SparseFixedBitSet#getAndSet()
0.57% 2411 java.lang.invoke.VarHandleSegmentAsShorts#checkAddress()
384-dimensions, trained using mlm model
TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value
PKLookup 179.25 (7.1%) 176.70 (5.5%) -1.4% ( -13% - 12%) 0.479
MedTermVector 327.54 (23.4%) 363.44 (16.3%) 11.0% ( -23% - 66%) 0.086
HighTermVector 318.49 (25.6%) 353.68 (16.8%) 11.0% ( -24% - 71%) 0.106
AndHighMedVector 321.10 (23.9%) 359.66 (12.3%) 12.0% ( -19% - 63%) 0.045
AndHighHighVector 313.74 (25.1%) 352.26 (19.5%) 12.3% ( -25% - 75%) 0.084
LowTermVector 313.92 (26.3%) 354.19 (15.2%) 12.8% ( -22% - 73%) 0.059
AndHighLowVector 241.66 (20.4%) 290.45 (13.5%) 20.2% ( -11% - 67%) 0.000
I saw bigger speedups on indexing tasks; they are more dedicated to pure vector operations I guess. Here is a comparison using a different script, knnPerfTest.py
that is based on KnnGraphTester
(using 768-dim mnet vectors):
baseline
recall query ms nDoc index ms
0.838 0.53 10000 7785
0.755 1.92 100000 179471
0.734 3.99 200000 485050
with change
recall query ms nDoc index ms
0.838 0.37 10000 4492
0.755 1.66 100000 95889
0.734 3.03 200000 260163
I'll follow up with some scripts for generating this data, copy some data to a download dir, and then maybe we can update the nightly setup. I'm a little worried about jumping to 768d vectors just in terms of indexing time. It might take 10 hrs single threaded, or maybe more? Perhaps if we reduced to 8-bit it will go a bit faster, or we could compromise and run the benchmarks using the 384-d vectors. Anyway for now I'll just post the scripts and data.
Ooooh these look like great results @msokolov! Thanks for testing. Sort of spooky we still see some "overhead" methods in the top N hotspots (jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()
, jdk.internal.misc.Unsafe#copyMemoryChecks()
, jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()
, etc.), but net/net it looks a lot better than before, I guess because of the increased iterations.
I'll follow up with some scripts for generating this data, copy some data to a download dir, and then maybe we can update the nightly setup.
+1
We may need to get the IndexRearranger
solution working (@zhaih!) if indeed the increased dimensionality is slow enough with the fully single threaded indexing. IndexRearranger
would allow us to build the same deterministic index much more quickly by using many threads to do the indexing, then rearrange the documents in the end.
For these tests I increased tasksPerCat=20 (from default of 1) to give more chance to warm up. This is pretty slow though, and I think if we did this, we wouldn't need to run 20 JVMs (results seem to converge pretty fast).
This is something that is a long standig mis-design of the benchmark. At beginning the benchmaker passed special command line flags like "-Xbatch
" the the JVM to "pre-optimize". This is no longer possible with recent Java versions (-Xbatch
is fine to preoptimize, but it stops at C1; the real C2 optimizations are not applied during batch compilation). Newer JVMs more rely on tiered optimization based on performance analysis.
More complex features take more time to optimize, so we should really have some more "real-world" setups. People use servers running 24/7 executing queries and indexing stuff. So our benchmark should emulate this. Like with JMH we should also add enough warmup "per JVM" before we actually measure and then let the JVM run for much longer time.
The current default settings with the small wikimedia dump are not representative at all and should no longer be used for people to test new lucene features. We should rethink and modify the benchmarks:
- add enough warmup before starting to measure. I am not sure what the best way to dynamically determine warmup time based on index size and query pool.
- use default JVM settings (tiered compilation) -> that's already done
- reduce the number of JVMs started. I think 4 rounds are way enough (default is 20 at moment).
thanks a lot for posting these indexer benchmarks @msokolov
Like with JMH we should also add enough warmup "per JVM" before we actually measure and then let the JVM run for much longer time.
You can control the number of warmup iterations (how many times each query will be executed, and then discarded), but it's not automagic, just a fixed count. And you can specify how many real iterations to run for each task.
- reduce the number of JVMs started. I think 4 rounds are way enough (default is 20 at moment).
+1, we should try reducing this.
We used 20 long ago because Hotspot noise was .... very noisy. Maybe hotspot variance is tighter these days and we can get strong enough signal with fewer iterations?
These gains are really quite incredible -- I tweeted about it: https://twitter.com/mikemccand/status/1664285218285694979?s=20
I'll try to cutover nightlies to the larger vectors...
I wonder if we also have some slowness on reading these float[]
(and byte[]
) vectors through IndexInput
? Maybe the Panama native memory APIs could help with this limitation in our MMap
impl?
I wonder if we also have some slowness on reading these
float[]
(andbyte[]
) vectors throughIndexInput
? Maybe the Panama native memory APIs could help with this limitation in ourMMap
impl?
We can only fix this at a later stage when the Vector APIs are public. At moment indeed we do duplicate copy (mmap copies to heap array, heap array is wrapped by vector API). In future we can change IndexInput API to return FloatVector instances (new method like readFloatVector) that is directly wrapping a segment slice.
At moment we can't do this as it would require our public API to expose incubation APIs. As hack we could use some new method Object readFloatVector()
, which returns null by default and a FloatVector instance hidden behind Object for the new MMAP impl (if the incubating mod is enabled). If it is non-null caller could cast it.
But that's all too crazy for our current code especially with incubation, so we may use this earliest when it goes to preview phase. Full typed support can only be done later. That's the major limitation: We can't expose those APIs unless our main code is on the minimal Java LTS version that has full non-incubating and non-preview support for vectors.
Once this is there we can rewrite much more!
@mikemccand what is the process for annotating benchmarks? https://home.apache.org/~mikemccand/lucenebench/indexing.html
I saw throughput and QPS drop and after some time, discovered this change :) Would be good to indicate so folks don't dig down the same rabbit hole.