Segment count (merging) can impact recall on KNN ParentJoin queries
I've been running benchmarks on the KNN parent-join query to get comparison numbers for multi-vectors (https://github.com/apache/lucene/pull/14173). I see a pretty notable difference in recall when merging was disabled on the writer. I would've expected latency to be somewhat impacted (although the impact here seems too high), but not recall. Creating an issue to dig more into this.
Setup
- Both lucene and luceneutil jar are on
mainbranch - To disable merges, I configured the writer's merge policy to
NoMergePolicy.INSTANCE. So while we still configure aConcurrentMergeScheduler, the merge policy does not find any merges, effectively disabling merging. More specifically, I added the following line toKnnIndexer.java:iwc.setMergePolicy(NoMergePolicy.INSTANCE); - There is no other change b/w the two setups compared here.
Benchmark Results
# Parent Join Queries
# merging enabled
recall latency(ms) nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.228 4.697 500000 100 50 64 250 no 113.17 4418.09 7 1473.92 1464.844 1464.844 HNSW
0.179 3.043 1000000 100 50 64 250 no 244.78 4085.27 5 2948.15 2929.688 2929.688 HNSW
0.202 3.735 2000000 100 50 64 250 no 469.05 4263.91 9 5896.90 5859.375 5859.375 HNSW
# merges disabled: note num_segments value
recall latency(ms) nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.378 13.976 500000 100 50 64 250 no 107.52 4650.43 16 1473.82 1464.844 1464.844 HNSW
0.415 21.928 1000000 100 50 64 250 no 225.22 4440.12 32 2947.82 2929.688 2929.688 HNSW
0.466 33.751 2000000 100 50 64 250 no 478.83 4176.83 63 5896.20 5859.375 5859.375 HNSW
This doesn't look like a problem with regular KNN vector queries, only appears with parent-join query benchmarks.
# Regular KNNFloatVectorQuery Benchmarks
# merging enabled
recall latency(ms) nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.969 23.109 500000 100 50 64 250 no 394.34 1267.93 8 1501.47 1464.844 1464.844 HNSW
0.916 11.001 1000000 100 50 64 250 no 1869.89 534.79 3 3017.29 2929.688 2929.688 HNSW
0.951 30.394 2000000 100 50 64 250 no 2756.49 725.56 10 6027.67 5859.375 5859.375 HNSW
# merging disabled: : note num_segments value
recall latency(ms) nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.705 51.087 500000 100 50 64 250 no 95.62 5229.14 89 1489.43 1464.844 1464.844 HNSW
0.960 90.863 1000000 100 50 64 250 no 192.26 5201.37 175 2980.16 2929.688 2929.688 HNSW
0.971 178.730 2000000 100 50 64 250 no 396.67 5042.00 346 5962.90 5859.375 5859.375 HNSW
Recall and latency with merges disabled is comparable if I increase setRAMBufferSizeMB for the writer and create fewer segments.
sadly, this is expected. It's not only parent-join, but any kind of approximate NN search. Think of the limit where we have as many segments as there are documents, recall will always be 100% because we will perform a "brute force" index scan.
If we want to figure out how to maintain the same recall as the index merges, that would be an interesting problem? The pro-rata collection method we've switched to now will tend to reduce the work done per segment as the segments shrink, but it has enough of a buffer that I think we'd still see this effect.
Why are the recall values so bad with parent-join queries (whether merging is enabled or not)? Is there a bug?
This doesn't look like a problem with regular KNN vector queries, only appears with parent-join query benchmarks.
Hmm it's odd for the 500K docs case that recall is so much better with FEWER segments: .969 with 8 segments, .705 with 89 segments.
For the other two rows (1M, 2M docs), recall is a bit better with more segments.
Do we expect better KNN single-valued vector recall with more segments? Hmm, maybe not with the new optimistic knn query (https://github.com/apache/lucene/pull/14226)? @vigyasharma were your runs with #14226?
Hmm it's odd for the 500K docs case that recall is so much better with FEWER segments: .969 with 8 segments, .705 with 89 segments.
I missed that - thanks for pointing it out. It's definitely not expected. I haven't played with parent-join vector queries, but agree the recall looks terrible - like there must be some kind of bug. Recall less than .6-.7 is generally kind of unusable.8 I wonder if we introduced some kind of regression recently. Do we have any long-term metrics tracking recall stats? I don't think luceneutil does that now, although we might have discussed the need for it before. Maybe we could try this on 10.x branch to compare?
Not sure if this is relevant: https://github.com/mikemccand/luceneutil/issues/385, but if it's the same problem it will be fixed with latest luceneutil.
Not sure if this is relevant: mikemccand/luceneutil#385, but if it's the same problem it will be fixed with latest luceneutil.
Neat find! I had also seen a drop in recall when all benchmark run configs were passed in the same command. For the above runs though, I ran each of them separately. I can run the numbers for 500k docs with merging disabled again just to be sure.