Enable Faiss-based vector format to index larger number of vectors in a single segment
Description
I was trying to index a large number of vectors in a single segment, and ran into an error because of the way we copy vectors to native memory, before calling Faiss to create an index:
Caused by: java.lang.IllegalStateException: Segment is too large to wrap as ByteBuffer. Size: 3276800000
at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314)
at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:158)
This limitation was hit because we use a ByteBuffer (backed by native memory) to copy vectors from heap -- which has a 2 GB size limit
As a fix, I've changed it to use MemorySegment specific functions to copy vectors (also moving away from these byte buffers in other places, and using more appropriate IO methods)
With these changes, we no longer see the above error and are able to build and search an index. Also ran benchmarks for a case where this limit was not hit to check for performance impact:
Baseline (on main):
type recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
faiss 0.997 1.855 1.819 0.981 100000 100 50 32 200 no 31.07 3218.44 32.76 1 3152.11 1562.500 1562.500 HNSW
Candidate (on this PR):
type recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
faiss 0.998 1.817 1.794 0.987 100000 100 50 32 200 no 29.57 3381.46 33.20 1 3152.11 1562.500 1562.500 HNSW
..and indexing / search performance is largely unchanged
Edit: Related to #14178
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
@msokolov I wasn't sure about attempting to index a large amount of vector data, given that it'll take up a few GB of RAM. I've added a test for now, please let me know if I should keep it (or how to test it better). Perhaps having the test is fine, because we run Faiss tests (and only those) in a separate GH action?
The test fails deterministically when added to main:
> java.lang.IllegalStateException: Segment is too large to wrap as ByteBuffer. Size: 2149576700
> at __randomizedtesting.SeedInfo.seed([1B557576B3F191C9:6F03E13EF7CEF63A]:0)
> at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374)
> at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.asByteBuffer(AbstractMemorySegmentImpl.java:199)
> at org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex(LibFaissC.java:224)
Sorry I was too vague - I didn't mean we should be testing the > 2GB case! I just wanted to make sure we had unit test coverage for these classes at all because I'm not familiar with this part of thge codebase
we had unit test coverage for these classes at all
Yes, we have a test class that runs all tests in the BaseKnnVectorsFormatTestCase
We had to modify / disable a few because the format only supports float vectors and a few similarity functions..
We run these tests on each PR / commit via GH actions, see sample run from this PR, which ran:
> Task :lucene:sandbox:test
:lucene:sandbox:test (SUCCESS): 53 test(s), 8 skipped
I didn't mean we should be testing the > 2GB case
I kind of like that we have this test, can we just mark it as "monster" so that we don't run it locally / from GH actions? Also refactored a bit to make backporting easier..
I was able to run it using:
./gradlew -p lucene/sandbox -Dtests.faiss.run=true test --tests "org.apache.lucene.sandbox.codecs.faiss.*" -Dtests.monster=true -Dtests.heapsize=16g
..where it took a (relatively) long time to run:
:lucene:sandbox:test (SUCCESS): 53 test(s), 8 skipped
The slowest tests during this run:
14.64s TestFaissKnnVectorsFormat.testLargeVectorData (:lucene:sandbox)
The slowest suites during this run:
16.63s TestFaissKnnVectorsFormat (:lucene:sandbox)
Also, running it on main gives the same error as above
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
I kind of like that we have this test, can we just mark it as "monster" so that we don't run it locally / from GH actions?
+1, this is exactly why we have the monster annotation!
Thanks @mikemccand!
Do we have any tests that check for memory leaks?
I don't think we have tests today, so I opened #14875 to track it -- plus the broader question of how to safely use the new format!
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.
@mikemccand I stumbled upon a way to allocate a long[] in native memory using a specific byte order (LITTLE_ENDIAN) -- which we use in a filtered search (i.e. if an explicit filter is provided, or the segment has deletes)
With this, I think we've moved away from all ByteBuffer usages to copy bytes to native memory in LibFaissC
Edit: Also posting a benchmark run to check that we didn't change any behavior
main:
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) selectivity filterType vec_disk(MB) vec_RAM(MB) indexType
0.702 1.893 1.765 0.932 100000 100 50 64 250 no 8.89 11251.13 10.70 1 637.45 0.10 pre-filter 292.969 292.969 HNSW
This PR:
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) selectivity filterType vec_disk(MB) vec_RAM(MB) indexType
0.702 1.851 1.763 0.952 100000 100 50 64 250 no 7.99 12514.08 10.40 1 637.45 0.10 pre-filter 292.969 292.969 HNSW
There is no tangible difference in performance (seems to be within range of noise)..
Thanks @kaivalnp -- I'll merge this one soon. Let's remember to also backport this to 10.x?
Could you also add an entry in CHANGES.txt? I think it's important to show that this Faiss based KNN Lucene codec format can handle large KNN indices...
entry in
CHANGES.txt
Thanks @mikemccand, I thought it was a follow-up to the original PR adding the codec, and may not need a separate entry -- but I've added one under "Bug Fixes" now..
I'll update the backport PR once this is merged!