lucene icon indicating copy to clipboard operation
lucene copied to clipboard

TopFieldCollector mistakenly assumes that all leaves share the same index sort

Open jpountz opened this issue 9 months ago • 6 comments

TopFieldCollector caches whether the search sort is a prefix of the index sort across leaves. While IndexWriter enforces that the whole index has the same index sort, it is possible to create a MultiReader across several indexes which have different index sorts, so this cache is incorrect.

jpountz avatar Mar 24 '25 21:03 jpountz

This is an interesting issue, and as such I don't see any good solution, other than removing the cache itself. I am wondering if it is good idea for Collector to know the List<LeafReaderContext>, similar to setWeight for passing Weight to Collector. TopFieldCollector should be able to compute searchSortPartOfIndexSort correctly and use the information within TopFieldLeafCollector.

jainankitk avatar Apr 14 '25 06:04 jainankitk

@jpountz - Any thoughts?

jainankitk avatar May 09 '25 18:05 jainankitk

I don't see an obvious solution either. My preference would be to remove the cache and make this decision on a per-segment basis, but this would require moving some methods around, e.g. Comparator#disableSkipping -> LeafComparator#disableSkipping.

jpountz avatar May 11 '25 06:05 jpountz

Would it make sense to have different collectors for the two cases, one with and one without a cache?

msokolov avatar May 11 '25 12:05 msokolov

What are the two cases that you have in mind? I don't think that having a collector with a cache makes sense since it has an assumption that leaves are uniform, which may not be correct. However, we could have different LeafCollectors for the case when the search sort is a prefix of the index sort on the one hand, and the case when the sort sort is not a prefix of the index sort on the other hand.

jpountz avatar May 12 '25 11:05 jpountz

Would it make sense to have different collectors for the two cases, one with and one without a cache?

I'm probably confused. I was thinking in a general way that perhaps we could have a decision that would allow results to be cached (and fetched from cache) only when searching in a context where all readers' leaves shared the same index sort, but I confess I don't have any clear idea how this would be implemented

msokolov avatar May 19 '25 15:05 msokolov