Separate Panama and Vector classes
Addresses #15284
VectorUtilBenchmark results:
main
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryCosineScalar 1024 thrpt 15 0.841 ± 0.001 ops/us
VectorUtilBenchmark.binaryCosineVector 1024 thrpt 15 4.778 ± 0.012 ops/us
VectorUtilBenchmark.binaryDotProductScalar 1024 thrpt 15 2.289 ± 0.012 ops/us
VectorUtilBenchmark.binaryDotProductUint8Scalar 1024 thrpt 15 2.307 ± 0.010 ops/us
VectorUtilBenchmark.binaryDotProductUint8Vector 1024 thrpt 15 8.040 ± 0.001 ops/us
VectorUtilBenchmark.binaryDotProductVector 1024 thrpt 15 8.040 ± 0.001 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar 1024 thrpt 15 2.368 ± 0.001 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector 1024 thrpt 15 11.652 ± 0.104 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar 1024 thrpt 15 2.378 ± 0.002 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 2.446 ± 0.009 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 2.627 ± 0.013 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector 1024 thrpt 15 20.677 ± 0.160 ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar 1024 thrpt 15 1.642 ± 0.001 ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector 1024 thrpt 15 12.614 ± 0.010 ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar 1024 thrpt 15 2.465 ± 0.006 ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar 1024 thrpt 15 2.022 ± 0.001 ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 2.590 ± 0.012 ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector 1024 thrpt 15 18.526 ± 0.012 ops/us
VectorUtilBenchmark.binarySquareScalar 1024 thrpt 15 2.431 ± 0.007 ops/us
VectorUtilBenchmark.binarySquareUint8Scalar 1024 thrpt 15 2.422 ± 0.025 ops/us
VectorUtilBenchmark.binarySquareUint8Vector 1024 thrpt 15 6.709 ± 0.002 ops/us
VectorUtilBenchmark.binarySquareVector 1024 thrpt 15 6.710 ± 0.001 ops/us
VectorUtilBenchmark.floatCosineScalar 1024 thrpt 15 1.419 ± 0.001 ops/us
VectorUtilBenchmark.floatCosineVector 1024 thrpt 75 8.913 ± 0.013 ops/us
VectorUtilBenchmark.floatDotProductScalar 1024 thrpt 15 3.734 ± 0.004 ops/us
VectorUtilBenchmark.floatDotProductVector 1024 thrpt 75 12.561 ± 0.346 ops/us
VectorUtilBenchmark.floatSquareScalar 1024 thrpt 15 3.181 ± 0.013 ops/us
VectorUtilBenchmark.floatSquareVector 1024 thrpt 75 12.370 ± 0.398 ops/us
VectorUtilBenchmark.l2Normalize 1024 thrpt 15 3.016 ± 0.002 ops/us
VectorUtilBenchmark.l2NormalizeVector 1024 thrpt 75 12.349 ± 0.719 ops/us
This PR
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryCosineScalar 1024 thrpt 15 0.841 ± 0.001 ops/us
VectorUtilBenchmark.binaryCosineVector 1024 thrpt 15 4.860 ± 0.007 ops/us
VectorUtilBenchmark.binaryDotProductScalar 1024 thrpt 15 2.298 ± 0.014 ops/us
VectorUtilBenchmark.binaryDotProductUint8Scalar 1024 thrpt 15 2.288 ± 0.024 ops/us
VectorUtilBenchmark.binaryDotProductUint8Vector 1024 thrpt 15 8.040 ± 0.001 ops/us
VectorUtilBenchmark.binaryDotProductVector 1024 thrpt 15 8.039 ± 0.001 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar 1024 thrpt 15 2.376 ± 0.003 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector 1024 thrpt 15 11.498 ± 0.286 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar 1024 thrpt 15 2.376 ± 0.002 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 2.449 ± 0.007 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 2.627 ± 0.009 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector 1024 thrpt 15 20.785 ± 0.009 ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar 1024 thrpt 15 1.696 ± 0.001 ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector 1024 thrpt 15 12.562 ± 0.023 ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar 1024 thrpt 15 2.474 ± 0.010 ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar 1024 thrpt 15 2.021 ± 0.006 ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 2.609 ± 0.015 ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector 1024 thrpt 15 18.487 ± 0.075 ops/us
VectorUtilBenchmark.binarySquareScalar 1024 thrpt 15 2.413 ± 0.021 ops/us
VectorUtilBenchmark.binarySquareUint8Scalar 1024 thrpt 15 2.420 ± 0.017 ops/us
VectorUtilBenchmark.binarySquareUint8Vector 1024 thrpt 15 6.709 ± 0.002 ops/us
VectorUtilBenchmark.binarySquareVector 1024 thrpt 15 6.709 ± 0.002 ops/us
VectorUtilBenchmark.floatCosineScalar 1024 thrpt 15 1.415 ± 0.002 ops/us
VectorUtilBenchmark.floatCosineVector 1024 thrpt 75 8.646 ± 0.080 ops/us
VectorUtilBenchmark.floatDotProductScalar 1024 thrpt 15 3.733 ± 0.003 ops/us
VectorUtilBenchmark.floatDotProductVector 1024 thrpt 75 12.249 ± 0.046 ops/us
VectorUtilBenchmark.floatSquareScalar 1024 thrpt 15 3.171 ± 0.008 ops/us
VectorUtilBenchmark.floatSquareVector 1024 thrpt 75 12.483 ± 0.104 ops/us
VectorUtilBenchmark.l2Normalize 1024 thrpt 15 3.017 ± 0.002 ops/us
VectorUtilBenchmark.l2NormalizeVector 1024 thrpt 75 12.207 ± 0.764 ops/us
Ran some luceneutil benchmarks on Cohere vectors, 768d for various vector similarities x quantization bits:
dot_product
main
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.641 0.675 0.666 0.987 200000 100 50 32 250 1 bits 5101 10.74 18627.18 20.85 1 624.45 606.918 20.981 HNSW
0.878 1.170 1.161 0.992 200000 100 50 32 250 4 bits 4662 12.20 16398.82 23.07 1 678.09 662.231 76.294 HNSW
0.915 1.517 1.505 0.992 200000 100 50 32 250 7 bits 4605 12.58 15896.99 31.01 1 751.27 735.474 149.536 HNSW
0.915 1.523 1.515 0.995 200000 100 50 32 250 8 bits 4570 11.64 17180.65 18.18 1 751.17 735.474 149.536 HNSW
This PR
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.641 0.678 0.668 0.985 200000 100 50 32 250 1 bits 5064 10.83 18467.22 21.32 1 624.43 606.918 20.981 HNSW
0.876 1.140 1.131 0.992 200000 100 50 32 250 4 bits 4660 11.67 17132.09 23.35 1 678.10 662.231 76.294 HNSW
0.914 1.514 1.504 0.993 200000 100 50 32 250 7 bits 4575 12.34 16208.77 18.19 1 751.21 735.474 149.536 HNSW
0.916 1.576 1.566 0.994 200000 100 50 32 250 8 bits 4580 12.32 16229.81 18.29 1 751.23 735.474 149.536 HNSW
mip
main
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.640 0.754 0.745 0.988 200000 100 50 32 250 1 bits 5076 11.12 17987.23 20.55 1 624.43 606.918 20.981 HNSW
0.877 1.174 1.165 0.992 200000 100 50 32 250 4 bits 4645 11.95 16737.80 24.10 1 678.11 662.231 76.294 HNSW
0.912 1.566 1.557 0.994 200000 100 50 32 250 7 bits 4573 11.96 16723.81 18.21 1 751.21 735.474 149.536 HNSW
0.916 1.509 1.500 0.994 200000 100 50 32 250 8 bits 4578 12.18 16416.32 18.29 1 751.19 735.474 149.536 HNSW
This PR
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.641 0.709 0.700 0.987 200000 100 50 32 250 1 bits 5080 11.68 17120.36 20.85 1 624.44 606.918 20.981 HNSW
0.877 1.191 1.182 0.992 200000 100 50 32 250 4 bits 4654 11.61 17232.47 22.12 1 678.11 662.231 76.294 HNSW
0.914 1.527 1.518 0.994 200000 100 50 32 250 7 bits 4585 12.27 16306.56 18.17 1 751.22 735.474 149.536 HNSW
0.915 1.541 1.532 0.994 200000 100 50 32 250 8 bits 4582 11.70 17091.10 18.30 1 751.22 735.474 149.536 HNSW
euclidean
main
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.691 0.625 0.615 0.984 200000 100 50 32 250 1 bits 4723 9.64 20751.19 17.36 1 615.12 606.918 20.981 HNSW
0.906 0.993 0.979 0.986 200000 100 50 32 250 4 bits 4413 10.70 18698.58 21.10 1 669.73 662.231 76.294 HNSW
0.948 1.361 1.353 0.994 200000 100 50 32 250 7 bits 4389 12.22 16369.29 25.86 1 743.24 735.474 149.536 HNSW
0.950 1.335 1.326 0.993 200000 100 50 32 250 8 bits 4387 11.31 17691.29 25.83 1 743.26 735.474 149.536 HNSW
This PR
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.692 0.628 0.618 0.984 200000 100 50 32 250 1 bits 4741 10.19 19627.09 17.71 1 615.11 606.918 20.981 HNSW
0.905 0.987 0.977 0.990 200000 100 50 32 250 4 bits 4416 10.46 19118.63 20.92 1 669.72 662.231 76.294 HNSW
0.949 1.396 1.387 0.994 200000 100 50 32 250 7 bits 4395 12.06 16579.62 25.65 1 743.22 735.474 149.536 HNSW
0.951 1.332 1.316 0.988 200000 100 50 32 250 8 bits 4382 12.03 16629.25 25.74 1 743.24 735.474 149.536 HNSW
cosine
main
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.656 0.641 0.632 0.986 200000 100 50 32 250 1 bits 4996 10.17 19663.75 17.60 1 616.88 606.918 20.981 HNSW
0.889 1.078 1.069 0.992 200000 100 50 32 250 4 bits 4603 10.64 18793.46 23.01 1 671.76 662.231 76.294 HNSW
0.944 1.438 1.429 0.994 200000 100 50 32 250 7 bits 4537 12.14 16477.18 27.64 1 745.81 735.474 149.536 HNSW
0.948 1.459 1.450 0.994 200000 100 50 32 250 8 bits 4524 11.83 16913.32 27.53 1 745.93 735.474 149.536 HNSW
This PR
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.657 0.644 0.635 0.986 200000 100 50 32 250 1 bits 5006 10.30 19411.82 17.96 1 616.85 606.918 20.981 HNSW
0.888 0.994 0.985 0.991 200000 100 50 32 250 4 bits 4565 11.39 17556.18 22.29 1 671.74 662.231 76.294 HNSW
0.945 1.422 1.413 0.994 200000 100 50 32 250 7 bits 4522 11.72 17064.85 27.42 1 745.81 735.474 149.536 HNSW
0.948 1.442 1.433 0.994 200000 100 50 32 250 8 bits 4514 11.94 16746.21 26.94 1 745.94 735.474 149.536 HNSW
Except for one outlier (dot_product, main, force_merge(s)), all values appear to be within ~5% of each other
I am not able to do any close review here, so please don't merge this now.
This PR
Maybe we could enhance Lucene's jmh infra so it can compare baseline/candidate runs somehow? It's hard for human eyes + brain to scan all those numbers and confirm there's no real difference... maybe open spinoff issue?
Edit: heh, and some comment about luceneutil's knnPerfTest.py? That tool has really flowered over time (and is now run in nightly benchmarks too) for testing all the many KNN options Lucene offers...
It's hard for human eyes + brain to scan all those numbers and confirm there's no real difference
Haha true :) I fed the raw data to an LLM and asked it to report percentage differences:
| Benchmark | Baseline Score (ops/μs) | Candidate Score (ops/μs) | % Difference |
|---|---|---|---|
| floatCosineVector | 8.913 | 8.646 | -3.00% |
| floatDotProductVector | 12.561 | 12.249 | -2.48% |
| binaryHalfByteDotProductBothPackedVector | 11.652 | 11.498 | -1.32% |
| l2NormalizeVector | 12.349 | 12.207 | -1.15% |
| binaryDotProductUint8Scalar | 2.307 | 2.288 | -0.82% |
| binarySquareScalar | 2.431 | 2.413 | -0.74% |
| binaryHalfByteSquareBothPackedVector | 12.614 | 12.562 | -0.41% |
| floatSquareScalar | 3.181 | 3.171 | -0.31% |
| floatCosineScalar | 1.419 | 1.415 | -0.28% |
| binaryHalfByteSquareVector | 18.526 | 18.487 | -0.21% |
| binaryHalfByteDotProductScalar | 2.378 | 2.376 | -0.08% |
| binarySquareUint8Scalar | 2.422 | 2.420 | -0.08% |
| binaryHalfByteSquareSinglePackedScalar | 2.022 | 2.021 | -0.05% |
| floatDotProductScalar | 3.734 | 3.733 | -0.03% |
| binaryDotProductVector | 8.040 | 8.039 | -0.01% |
| binarySquareVector | 6.710 | 6.709 | -0.01% |
| binaryCosineScalar | 0.841 | 0.841 | 0.00% |
| binaryDotProductUint8Vector | 8.040 | 8.040 | 0.00% |
| binaryHalfByteDotProductSinglePackedVector | 2.627 | 2.627 | 0.00% |
| binarySquareUint8Vector | 6.709 | 6.709 | 0.00% |
| l2Normalize | 3.016 | 3.017 | 0.03% |
| binaryHalfByteDotProductSinglePackedScalar | 2.446 | 2.449 | 0.12% |
| binaryHalfByteDotProductBothPackedScalar | 2.368 | 2.376 | 0.34% |
| binaryHalfByteSquareScalar | 2.465 | 2.474 | 0.36% |
| binaryDotProductScalar | 2.289 | 2.298 | 0.39% |
| binaryHalfByteDotProductVector | 20.677 | 20.785 | 0.52% |
| binaryHalfByteSquareSinglePackedVector | 2.590 | 2.609 | 0.73% |
| floatSquareVector | 12.370 | 12.483 | 0.91% |
| binaryCosineVector | 4.778 | 4.860 | 1.72% |
| binaryHalfByteSquareBothPackedScalar | 1.642 | 1.696 | 3.29% |
Side note: I found this cool visualizer (https://jmh.morethan.io), which takes the JSON output of JMH (add -rf json to the command line), and can compare multiple runs too!
For example, I re-ran a subset of functions and recorded their output in https://gist.github.com/kaivalnp/0424bd84326aebdecd10f8144fb46c73 Now we can visualize the results at: https://jmh.morethan.io/?gist=0424bd84326aebdecd10f8144fb46c73
Also found this GH action that automatically runs and compares JMH output: https://github.com/benchmark-action/github-action-benchmark, might be interesting to add to Lucene!
@uschindler just wanted to ask, did you get a chance to look at these changes? Thanks!
This change makes a lot of sense -- the FFM part of Panama is done incubating as of Java 22, so we should promote it out of Lucene's mrjar sources?
@kaivalnp I think you've addressed all of @uschindler's comments? Can we take this out of draft now? Are there any other parts you still need to do?
Thanks @mikemccand, IIUC Uwe's main concern was to not expose incubating / experimental APIs (i.e. the Vector API) publicly (which was true in this PR) -- I'm not sure what else to look out for..
One other potential issue I see is the renaming of public APIs -- like VectorUtil::dotProduct -> VectorUtil::dotProductBytes / VectorUtil::dotProductFloats -- earlier, we didn't expose signatures for off-heap computations, so the signature was VectorUtil::dotProduct(byte[], byte[]) and VectorUtil::dotProduct(float[], float[])), but now that we do -- they'll both be VectorUtil::dotProduct(MemorySegment, MemorySegment) -- and need to be separated.
I'll pull out of draft!
Hi, Sorry for the delay. The FFM part was already moved out of the separate sourceSet in main branch (we no longer have a mrjar in reality, it's copied together, we just have separate compilation units to allow is to compile against a stable API and optionally add MRJAR when panama changes in later have versions). The separation was already removed... except the glue classes for vectors. So MmapDir and madvise is already part of main sources. So the issue description is wrong. This is only about paname vectors.
What's done here is just moving some glue classes to allow panama vector to access memory segments provided by MmapDir and possibly a new MemorySegment backed replacement of Bytebuffersdirectory to the main sources. This is needed and ok to do, but my main issue with this PR is the additional complexity just to achieve this! Why do we need all those additional abstractions with functional interfaces everywhere? I tried to understand this and every time I looked at this PR I gave up after 20 minutes starring at those horrible abstractions with generics. Let's have exactly one interface preferably without generics as glue part between panama vector in separate sourceSet.
Please make it simpler or give a full explanation why we need all those extra generics and functional interfaces. The pr adds 400 extra lines of code instead of making it simpler!
Another problem which makes reviewing harder is the additional renames of methods. Can we separate that to make it easier to get a glimpse what's going on?
Sorry for the delay but it's busy here and that's too.much complexity for a quick review. Maybe @rmuir can also have a look.
And finally: we have some benchmarks here, but those are too simple to show how the additional abstractions affect hotspot compiler of executed in real code. A micro benchmark like this has lead to problems suddenly only appearing on Mike's benchmarks. Because if you benchmark a little bit of code with abstractions, hotspot has an easy job to remove the abstractions. But in complex environments during query execution the additional abstractions can kill your performance!
So because of this I am really afraid of this PR to go in in that current form.
Thanks @uschindler -- sorry for my ignorance, indeed I see we already promoted MMapIndexInput (using MemorySegment etc.)...
Hi, I think the current PR matches most of the work to do, I am just not sure if we really need that added complexity.
This comment in issue:
IMO it would provide a cleaner separation of functionality and simplify code a bit too (for e.g. we can move classes like Lucene99MemorySegmentByteVectorScorer out of java25/) + users that do not enable vectorization can score vectors off-heap
This is an argument, but still I don't see a reason to write code on top of MemorySegment for non-vectorized code. It does not get simpler, because you still need two different implementations: One for MemorySegments and one from RandomAccessInput / IndexInput.
The current code duplicates a lot of methods and has variants for (byte[],byte[]), (MemorySegment,byte[]), (MemorySegment,MemorySegment).
Maybe for Lucene 11 the better idea would be to implement VectorUtil only with MemorySegment and throw away the byte[] impls. The code should be same speed (if Hotspot works correct). If you have byte[] code you can wrap it as a MemorySegment before calling VectorUtil.
Provokative remark: In general, I tend to think that at some point we should throw away ByteBuffer and byte[] everywhere in our code and replace it by MemorySegment. This would also allow us to get rid of certain 31 bit limitations. Of course I am planning to submit a PR to replace the ByteBuffersDirectory by a MemorySegment backed variant! Keep in mind that MemorySegment also works on-heap! This would allow for example the current NRTCachingDirectory to vectorize like MMapDir.
P.S.: Sorry if my comment yesterday was a bit harsh reagrding "horrible generics".
Uwe
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!