lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Segment count (merging) can impact recall on KNN ParentJoin queries

Open vigyasharma opened this issue 7 months ago • 6 comments

I've been running benchmarks on the KNN parent-join query to get comparison numbers for multi-vectors (https://github.com/apache/lucene/pull/14173). I see a pretty notable difference in recall when merging was disabled on the writer. I would've expected latency to be somewhat impacted (although the impact here seems too high), but not recall. Creating an issue to dig more into this.

Setup

  1. Both lucene and luceneutil jar are on main branch
  2. To disable merges, I configured the writer's merge policy to NoMergePolicy.INSTANCE. So while we still configure a ConcurrentMergeScheduler, the merge policy does not find any merges, effectively disabling merging. More specifically, I added the following line to KnnIndexer.java:
    iwc.setMergePolicy(NoMergePolicy.INSTANCE);
    
  3. There is no other change b/w the two setups compared here.

Benchmark Results

# Parent Join Queries
# merging enabled
 recall  latency(ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.228        4.697   500000   100      50       64        250         no    113.17       4418.09             7         1473.92      1464.844     1464.844       HNSW
 0.179        3.043  1000000   100      50       64        250         no    244.78       4085.27             5         2948.15      2929.688     2929.688       HNSW
 0.202        3.735  2000000   100      50       64        250         no    469.05       4263.91             9         5896.90      5859.375     5859.375       HNSW

# merges disabled: note num_segments value 
recall  latency(ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.378       13.976   500000   100      50       64        250         no    107.52       4650.43            16         1473.82      1464.844     1464.844       HNSW
 0.415       21.928  1000000   100      50       64        250         no    225.22       4440.12            32         2947.82      2929.688     2929.688       HNSW
 0.466       33.751  2000000   100      50       64        250         no    478.83       4176.83            63         5896.20      5859.375     5859.375       HNSW


This doesn't look like a problem with regular KNN vector queries, only appears with parent-join query benchmarks.

# Regular KNNFloatVectorQuery Benchmarks
# merging enabled
 recall  latency(ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.969       23.109   500000   100      50       64        250         no    394.34       1267.93             8         1501.47      1464.844     1464.844       HNSW
 0.916       11.001  1000000   100      50       64        250         no   1869.89        534.79             3         3017.29      2929.688     2929.688       HNSW
 0.951       30.394  2000000   100      50       64        250         no   2756.49        725.56            10         6027.67      5859.375     5859.375       HNSW

# merging disabled: : note num_segments value
 recall  latency(ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.705       51.087   500000   100      50       64        250         no     95.62       5229.14            89         1489.43      1464.844     1464.844       HNSW
 0.960       90.863  1000000   100      50       64        250         no    192.26       5201.37           175         2980.16      2929.688     2929.688       HNSW
 0.971      178.730  2000000   100      50       64        250         no    396.67       5042.00           346         5962.90      5859.375     5859.375       HNSW

Recall and latency with merges disabled is comparable if I increase setRAMBufferSizeMB for the writer and create fewer segments.

vigyasharma avatar May 11 '25 00:05 vigyasharma

sadly, this is expected. It's not only parent-join, but any kind of approximate NN search. Think of the limit where we have as many segments as there are documents, recall will always be 100% because we will perform a "brute force" index scan.

If we want to figure out how to maintain the same recall as the index merges, that would be an interesting problem? The pro-rata collection method we've switched to now will tend to reduce the work done per segment as the segments shrink, but it has enough of a buffer that I think we'd still see this effect.

msokolov avatar May 11 '25 12:05 msokolov

Why are the recall values so bad with parent-join queries (whether merging is enabled or not)? Is there a bug?

jpountz avatar May 11 '25 20:05 jpountz

This doesn't look like a problem with regular KNN vector queries, only appears with parent-join query benchmarks.

Hmm it's odd for the 500K docs case that recall is so much better with FEWER segments: .969 with 8 segments, .705 with 89 segments.

For the other two rows (1M, 2M docs), recall is a bit better with more segments.

Do we expect better KNN single-valued vector recall with more segments? Hmm, maybe not with the new optimistic knn query (https://github.com/apache/lucene/pull/14226)? @vigyasharma were your runs with #14226?

mikemccand avatar May 20 '25 13:05 mikemccand

Hmm it's odd for the 500K docs case that recall is so much better with FEWER segments: .969 with 8 segments, .705 with 89 segments.

I missed that - thanks for pointing it out. It's definitely not expected. I haven't played with parent-join vector queries, but agree the recall looks terrible - like there must be some kind of bug. Recall less than .6-.7 is generally kind of unusable.8 I wonder if we introduced some kind of regression recently. Do we have any long-term metrics tracking recall stats? I don't think luceneutil does that now, although we might have discussed the need for it before. Maybe we could try this on 10.x branch to compare?

msokolov avatar May 22 '25 19:05 msokolov

Not sure if this is relevant: https://github.com/mikemccand/luceneutil/issues/385, but if it's the same problem it will be fixed with latest luceneutil.

dungba88 avatar May 23 '25 02:05 dungba88

Not sure if this is relevant: mikemccand/luceneutil#385, but if it's the same problem it will be fixed with latest luceneutil.

Neat find! I had also seen a drop in recall when all benchmark run configs were passed in the same command. For the above runs though, I ran each of them separately. I can run the numbers for 500k docs with merging disabled again just to be sure.

vigyasharma avatar May 25 '25 05:05 vigyasharma