BlackLab
BlackLab copied to clipboard
Try to ensure we access the forward index sequentially
HitsFromQueryParallel may add documents in more or less random order because documents from different segments are added in parallel. This may lead to thrashing the disk cache if we're sorting/grouping on context using the forward index. It would be better to do a quick sort by docid before sorting on context.
Note that we can reuse Hits.withAscendingLuceneDocIds()
for this. This method was created because the new DocValues
API requires this property, but it can be useful for the forward index as well.
One thing to keep in mind is that ascending ids do not necessarily mean sequential reads from disk, because documents can be deleted from the forward index and the freed space can be reused for other documents.
All this becomes less of an issue when the forward index is integrated into the Lucene index, because then everything becomes segment-based, which leads to more locality of access anyway.
The new integrated index format will help to solve this, but we would still need to get rid of the global forward index interface, e.g. sort/group hits by context per-segment first, then merge the sorted/grouped results from each segment. The way we do it now likely still leads to quite random disk access, unfortunately.