lucene Separate Panama and Vector classes

Addresses #15284

Oct 02 '25 23:10 kaivalnp

VectorUtilBenchmark results:

main

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                            1024  thrpt   15   0.841 ± 0.001  ops/us
VectorUtilBenchmark.binaryCosineVector                            1024  thrpt   15   4.778 ± 0.012  ops/us
VectorUtilBenchmark.binaryDotProductScalar                        1024  thrpt   15   2.289 ± 0.012  ops/us
VectorUtilBenchmark.binaryDotProductUint8Scalar                   1024  thrpt   15   2.307 ± 0.010  ops/us
VectorUtilBenchmark.binaryDotProductUint8Vector                   1024  thrpt   15   8.040 ± 0.001  ops/us
VectorUtilBenchmark.binaryDotProductVector                        1024  thrpt   15   8.040 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar      1024  thrpt   15   2.368 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  11.652 ± 0.104  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar                1024  thrpt   15   2.378 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.446 ± 0.009  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   2.627 ± 0.013  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.677 ± 0.160  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar          1024  thrpt   15   1.642 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  12.614 ± 0.010  ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar                    1024  thrpt   15   2.465 ± 0.006  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.022 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   2.590 ± 0.012  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.526 ± 0.012  ops/us
VectorUtilBenchmark.binarySquareScalar                            1024  thrpt   15   2.431 ± 0.007  ops/us
VectorUtilBenchmark.binarySquareUint8Scalar                       1024  thrpt   15   2.422 ± 0.025  ops/us
VectorUtilBenchmark.binarySquareUint8Vector                       1024  thrpt   15   6.709 ± 0.002  ops/us
VectorUtilBenchmark.binarySquareVector                            1024  thrpt   15   6.710 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineScalar                             1024  thrpt   15   1.419 ± 0.001  ops/us
VectorUtilBenchmark.floatCosineVector                             1024  thrpt   75   8.913 ± 0.013  ops/us
VectorUtilBenchmark.floatDotProductScalar                         1024  thrpt   15   3.734 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductVector                         1024  thrpt   75  12.561 ± 0.346  ops/us
VectorUtilBenchmark.floatSquareScalar                             1024  thrpt   15   3.181 ± 0.013  ops/us
VectorUtilBenchmark.floatSquareVector                             1024  thrpt   75  12.370 ± 0.398  ops/us
VectorUtilBenchmark.l2Normalize                                   1024  thrpt   15   3.016 ± 0.002  ops/us
VectorUtilBenchmark.l2NormalizeVector                             1024  thrpt   75  12.349 ± 0.719  ops/us

This PR

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                            1024  thrpt   15   0.841 ± 0.001  ops/us
VectorUtilBenchmark.binaryCosineVector                            1024  thrpt   15   4.860 ± 0.007  ops/us
VectorUtilBenchmark.binaryDotProductScalar                        1024  thrpt   15   2.298 ± 0.014  ops/us
VectorUtilBenchmark.binaryDotProductUint8Scalar                   1024  thrpt   15   2.288 ± 0.024  ops/us
VectorUtilBenchmark.binaryDotProductUint8Vector                   1024  thrpt   15   8.040 ± 0.001  ops/us
VectorUtilBenchmark.binaryDotProductVector                        1024  thrpt   15   8.039 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar      1024  thrpt   15   2.376 ± 0.003  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  11.498 ± 0.286  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar                1024  thrpt   15   2.376 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.449 ± 0.007  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   2.627 ± 0.009  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.785 ± 0.009  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar          1024  thrpt   15   1.696 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  12.562 ± 0.023  ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar                    1024  thrpt   15   2.474 ± 0.010  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.021 ± 0.006  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   2.609 ± 0.015  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.487 ± 0.075  ops/us
VectorUtilBenchmark.binarySquareScalar                            1024  thrpt   15   2.413 ± 0.021  ops/us
VectorUtilBenchmark.binarySquareUint8Scalar                       1024  thrpt   15   2.420 ± 0.017  ops/us
VectorUtilBenchmark.binarySquareUint8Vector                       1024  thrpt   15   6.709 ± 0.002  ops/us
VectorUtilBenchmark.binarySquareVector                            1024  thrpt   15   6.709 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineScalar                             1024  thrpt   15   1.415 ± 0.002  ops/us
VectorUtilBenchmark.floatCosineVector                             1024  thrpt   75   8.646 ± 0.080  ops/us
VectorUtilBenchmark.floatDotProductScalar                         1024  thrpt   15   3.733 ± 0.003  ops/us
VectorUtilBenchmark.floatDotProductVector                         1024  thrpt   75  12.249 ± 0.046  ops/us
VectorUtilBenchmark.floatSquareScalar                             1024  thrpt   15   3.171 ± 0.008  ops/us
VectorUtilBenchmark.floatSquareVector                             1024  thrpt   75  12.483 ± 0.104  ops/us
VectorUtilBenchmark.l2Normalize                                   1024  thrpt   15   3.017 ± 0.002  ops/us
VectorUtilBenchmark.l2NormalizeVector                             1024  thrpt   75  12.207 ± 0.764  ops/us

Oct 06 '25 13:10 kaivalnp

Ran some luceneutil benchmarks on Cohere vectors, 768d for various vector similarities x quantization bits:

`dot_product`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.641        0.675   0.666        0.987  200000   100      50       32        250     1 bits     5101     10.74      18627.18           20.85             1          624.45       606.918       20.981       HNSW
 0.878        1.170   1.161        0.992  200000   100      50       32        250     4 bits     4662     12.20      16398.82           23.07             1          678.09       662.231       76.294       HNSW
 0.915        1.517   1.505        0.992  200000   100      50       32        250     7 bits     4605     12.58      15896.99           31.01             1          751.27       735.474      149.536       HNSW
 0.915        1.523   1.515        0.995  200000   100      50       32        250     8 bits     4570     11.64      17180.65           18.18             1          751.17       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.641        0.678   0.668        0.985  200000   100      50       32        250     1 bits     5064     10.83      18467.22           21.32             1          624.43       606.918       20.981       HNSW
 0.876        1.140   1.131        0.992  200000   100      50       32        250     4 bits     4660     11.67      17132.09           23.35             1          678.10       662.231       76.294       HNSW
 0.914        1.514   1.504        0.993  200000   100      50       32        250     7 bits     4575     12.34      16208.77           18.19             1          751.21       735.474      149.536       HNSW
 0.916        1.576   1.566        0.994  200000   100      50       32        250     8 bits     4580     12.32      16229.81           18.29             1          751.23       735.474      149.536       HNSW

`mip`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.640        0.754   0.745        0.988  200000   100      50       32        250     1 bits     5076     11.12      17987.23           20.55             1          624.43       606.918       20.981       HNSW
 0.877        1.174   1.165        0.992  200000   100      50       32        250     4 bits     4645     11.95      16737.80           24.10             1          678.11       662.231       76.294       HNSW
 0.912        1.566   1.557        0.994  200000   100      50       32        250     7 bits     4573     11.96      16723.81           18.21             1          751.21       735.474      149.536       HNSW
 0.916        1.509   1.500        0.994  200000   100      50       32        250     8 bits     4578     12.18      16416.32           18.29             1          751.19       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.641        0.709   0.700        0.987  200000   100      50       32        250     1 bits     5080     11.68      17120.36           20.85             1          624.44       606.918       20.981       HNSW
 0.877        1.191   1.182        0.992  200000   100      50       32        250     4 bits     4654     11.61      17232.47           22.12             1          678.11       662.231       76.294       HNSW
 0.914        1.527   1.518        0.994  200000   100      50       32        250     7 bits     4585     12.27      16306.56           18.17             1          751.22       735.474      149.536       HNSW
 0.915        1.541   1.532        0.994  200000   100      50       32        250     8 bits     4582     11.70      17091.10           18.30             1          751.22       735.474      149.536       HNSW

`euclidean`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.691        0.625   0.615        0.984  200000   100      50       32        250     1 bits     4723      9.64      20751.19           17.36             1          615.12       606.918       20.981       HNSW
 0.906        0.993   0.979        0.986  200000   100      50       32        250     4 bits     4413     10.70      18698.58           21.10             1          669.73       662.231       76.294       HNSW
 0.948        1.361   1.353        0.994  200000   100      50       32        250     7 bits     4389     12.22      16369.29           25.86             1          743.24       735.474      149.536       HNSW
 0.950        1.335   1.326        0.993  200000   100      50       32        250     8 bits     4387     11.31      17691.29           25.83             1          743.26       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.692        0.628   0.618        0.984  200000   100      50       32        250     1 bits     4741     10.19      19627.09           17.71             1          615.11       606.918       20.981       HNSW
 0.905        0.987   0.977        0.990  200000   100      50       32        250     4 bits     4416     10.46      19118.63           20.92             1          669.72       662.231       76.294       HNSW
 0.949        1.396   1.387        0.994  200000   100      50       32        250     7 bits     4395     12.06      16579.62           25.65             1          743.22       735.474      149.536       HNSW
 0.951        1.332   1.316        0.988  200000   100      50       32        250     8 bits     4382     12.03      16629.25           25.74             1          743.24       735.474      149.536       HNSW

`cosine`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.656        0.641   0.632        0.986  200000   100      50       32        250     1 bits     4996     10.17      19663.75           17.60             1          616.88       606.918       20.981       HNSW
 0.889        1.078   1.069        0.992  200000   100      50       32        250     4 bits     4603     10.64      18793.46           23.01             1          671.76       662.231       76.294       HNSW
 0.944        1.438   1.429        0.994  200000   100      50       32        250     7 bits     4537     12.14      16477.18           27.64             1          745.81       735.474      149.536       HNSW
 0.948        1.459   1.450        0.994  200000   100      50       32        250     8 bits     4524     11.83      16913.32           27.53             1          745.93       735.474      149.536       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.657        0.644   0.635        0.986  200000   100      50       32        250     1 bits     5006     10.30      19411.82           17.96             1          616.85       606.918       20.981       HNSW
 0.888        0.994   0.985        0.991  200000   100      50       32        250     4 bits     4565     11.39      17556.18           22.29             1          671.74       662.231       76.294       HNSW
 0.945        1.422   1.413        0.994  200000   100      50       32        250     7 bits     4522     11.72      17064.85           27.42             1          745.81       735.474      149.536       HNSW
 0.948        1.442   1.433        0.994  200000   100      50       32        250     8 bits     4514     11.94      16746.21           26.94             1          745.94       735.474      149.536       HNSW

Except for one outlier (dot_product, main, force_merge(s)), all values appear to be within ~5% of each other

Oct 06 '25 15:10 kaivalnp

I am not able to do any close review here, so please don't merge this now.

Oct 12 '25 01:10 uschindler

This PR

Maybe we could enhance Lucene's jmh infra so it can compare baseline/candidate runs somehow? It's hard for human eyes + brain to scan all those numbers and confirm there's no real difference... maybe open spinoff issue?

Edit: heh, and some comment about luceneutil's knnPerfTest.py? That tool has really flowered over time (and is now run in nightly benchmarks too) for testing all the many KNN options Lucene offers...

Oct 13 '25 14:10 mikemccand

It's hard for human eyes + brain to scan all those numbers and confirm there's no real difference

Haha true :) I fed the raw data to an LLM and asked it to report percentage differences:

Benchmark	Baseline Score (ops/μs)	Candidate Score (ops/μs)	% Difference
floatCosineVector	8.913	8.646	-3.00%
floatDotProductVector	12.561	12.249	-2.48%
binaryHalfByteDotProductBothPackedVector	11.652	11.498	-1.32%
l2NormalizeVector	12.349	12.207	-1.15%
binaryDotProductUint8Scalar	2.307	2.288	-0.82%
binarySquareScalar	2.431	2.413	-0.74%
binaryHalfByteSquareBothPackedVector	12.614	12.562	-0.41%
floatSquareScalar	3.181	3.171	-0.31%
floatCosineScalar	1.419	1.415	-0.28%
binaryHalfByteSquareVector	18.526	18.487	-0.21%
binaryHalfByteDotProductScalar	2.378	2.376	-0.08%
binarySquareUint8Scalar	2.422	2.420	-0.08%
binaryHalfByteSquareSinglePackedScalar	2.022	2.021	-0.05%
floatDotProductScalar	3.734	3.733	-0.03%
binaryDotProductVector	8.040	8.039	-0.01%
binarySquareVector	6.710	6.709	-0.01%
binaryCosineScalar	0.841	0.841	0.00%
binaryDotProductUint8Vector	8.040	8.040	0.00%
binaryHalfByteDotProductSinglePackedVector	2.627	2.627	0.00%
binarySquareUint8Vector	6.709	6.709	0.00%
l2Normalize	3.016	3.017	0.03%
binaryHalfByteDotProductSinglePackedScalar	2.446	2.449	0.12%
binaryHalfByteDotProductBothPackedScalar	2.368	2.376	0.34%
binaryHalfByteSquareScalar	2.465	2.474	0.36%
binaryDotProductScalar	2.289	2.298	0.39%
binaryHalfByteDotProductVector	20.677	20.785	0.52%
binaryHalfByteSquareSinglePackedVector	2.590	2.609	0.73%
floatSquareVector	12.370	12.483	0.91%
binaryCosineVector	4.778	4.860	1.72%
binaryHalfByteSquareBothPackedScalar	1.642	1.696	3.29%

Side note: I found this cool visualizer (https://jmh.morethan.io), which takes the JSON output of JMH (add -rf json to the command line), and can compare multiple runs too!

For example, I re-ran a subset of functions and recorded their output in https://gist.github.com/kaivalnp/0424bd84326aebdecd10f8144fb46c73 Now we can visualize the results at: https://jmh.morethan.io/?gist=0424bd84326aebdecd10f8144fb46c73

Also found this GH action that automatically runs and compares JMH output: https://github.com/benchmark-action/github-action-benchmark, might be interesting to add to Lucene!

Oct 13 '25 17:10 kaivalnp

@uschindler just wanted to ask, did you get a chance to look at these changes? Thanks!

Nov 03 '25 15:11 kaivalnp

This change makes a lot of sense -- the FFM part of Panama is done incubating as of Java 22, so we should promote it out of Lucene's mrjar sources?

@kaivalnp I think you've addressed all of @uschindler's comments? Can we take this out of draft now? Are there any other parts you still need to do?

Nov 11 '25 20:11 mikemccand

Thanks @mikemccand, IIUC Uwe's main concern was to not expose incubating / experimental APIs (i.e. the Vector API) publicly (which was true in this PR) -- I'm not sure what else to look out for..

One other potential issue I see is the renaming of public APIs -- like VectorUtil::dotProduct -> VectorUtil::dotProductBytes / VectorUtil::dotProductFloats -- earlier, we didn't expose signatures for off-heap computations, so the signature was VectorUtil::dotProduct(byte[], byte[]) and VectorUtil::dotProduct(float[], float[])), but now that we do -- they'll both be VectorUtil::dotProduct(MemorySegment, MemorySegment) -- and need to be separated.

I'll pull out of draft!

Nov 11 '25 20:11 kaivalnp

Hi, Sorry for the delay. The FFM part was already moved out of the separate sourceSet in main branch (we no longer have a mrjar in reality, it's copied together, we just have separate compilation units to allow is to compile against a stable API and optionally add MRJAR when panama changes in later have versions). The separation was already removed... except the glue classes for vectors. So MmapDir and madvise is already part of main sources. So the issue description is wrong. This is only about paname vectors.

What's done here is just moving some glue classes to allow panama vector to access memory segments provided by MmapDir and possibly a new MemorySegment backed replacement of Bytebuffersdirectory to the main sources. This is needed and ok to do, but my main issue with this PR is the additional complexity just to achieve this! Why do we need all those additional abstractions with functional interfaces everywhere? I tried to understand this and every time I looked at this PR I gave up after 20 minutes starring at those horrible abstractions with generics. Let's have exactly one interface preferably without generics as glue part between panama vector in separate sourceSet.

Please make it simpler or give a full explanation why we need all those extra generics and functional interfaces. The pr adds 400 extra lines of code instead of making it simpler!

Another problem which makes reviewing harder is the additional renames of methods. Can we separate that to make it easier to get a glimpse what's going on?

Sorry for the delay but it's busy here and that's too.much complexity for a quick review. Maybe @rmuir can also have a look.

And finally: we have some benchmarks here, but those are too simple to show how the additional abstractions affect hotspot compiler of executed in real code. A micro benchmark like this has lead to problems suddenly only appearing on Mike's benchmarks. Because if you benchmark a little bit of code with abstractions, hotspot has an easy job to remove the abstractions. But in complex environments during query execution the additional abstractions can kill your performance!

So because of this I am really afraid of this PR to go in in that current form.

Nov 11 '25 21:11 uschindler

Thanks @uschindler -- sorry for my ignorance, indeed I see we already promoted MMapIndexInput (using MemorySegment etc.)...

Nov 12 '25 13:11 mikemccand

Hi, I think the current PR matches most of the work to do, I am just not sure if we really need that added complexity.

This comment in issue:

IMO it would provide a cleaner separation of functionality and simplify code a bit too (for e.g. we can move classes like Lucene99MemorySegmentByteVectorScorer out of java25/) + users that do not enable vectorization can score vectors off-heap

This is an argument, but still I don't see a reason to write code on top of MemorySegment for non-vectorized code. It does not get simpler, because you still need two different implementations: One for MemorySegments and one from RandomAccessInput / IndexInput.

The current code duplicates a lot of methods and has variants for (byte[],byte[]), (MemorySegment,byte[]), (MemorySegment,MemorySegment).

Maybe for Lucene 11 the better idea would be to implement VectorUtil only with MemorySegment and throw away the byte[] impls. The code should be same speed (if Hotspot works correct). If you have byte[] code you can wrap it as a MemorySegment before calling VectorUtil.

Provokative remark: In general, I tend to think that at some point we should throw away ByteBuffer and byte[] everywhere in our code and replace it by MemorySegment. This would also allow us to get rid of certain 31 bit limitations. Of course I am planning to submit a PR to replace the ByteBuffersDirectory by a MemorySegment backed variant! Keep in mind that MemorySegment also works on-heap! This would allow for example the current NRTCachingDirectory to vectorize like MMapDir.

P.S.: Sorry if my comment yesterday was a bit harsh reagrding "horrible generics".

Uwe

Nov 12 '25 16:11 uschindler

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Nov 27 '25 00:11 github-actions[bot]