BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Try to ensure we access the forward index sequentially

Open jan-niestadt opened this issue 2 years ago • 2 comments

HitsFromQueryParallel may add documents in more or less random order because documents from different segments are added in parallel. This may lead to thrashing the disk cache if we're sorting/grouping on context using the forward index. It would be better to do a quick sort by docid before sorting on context.

jan-niestadt avatar Apr 12 '22 14:04 jan-niestadt

Note that we can reuse Hits.withAscendingLuceneDocIds() for this. This method was created because the new DocValues API requires this property, but it can be useful for the forward index as well.

One thing to keep in mind is that ascending ids do not necessarily mean sequential reads from disk, because documents can be deleted from the forward index and the freed space can be reused for other documents.

All this becomes less of an issue when the forward index is integrated into the Lucene index, because then everything becomes segment-based, which leads to more locality of access anyway.

jan-niestadt avatar Apr 13 '22 10:04 jan-niestadt

The new integrated index format will help to solve this, but we would still need to get rid of the global forward index interface, e.g. sort/group hits by context per-segment first, then merge the sorted/grouped results from each segment. The way we do it now likely still leads to quite random disk access, unfortunately.

jan-niestadt avatar Feb 21 '23 14:02 jan-niestadt