tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Intersection Performance

Open PSeitz opened this issue 1 year ago • 2 comments

The current intersection code looks like this (simplified)

fn advance(&mut self) -> DocId {
    let (left, right) = (&mut self.left, &mut self.right);
    let mut candidate = left.advance();

    loop {
        let right_doc = right.seek(candidate);
        candidate = left.seek(right_doc);
        if candidate == right_doc {
            return candidate;
        }
    }
}

left is the doc_set with the smallest size_hint in the group. It is advanced first then we seek on right. seek is not only going to the passed candidate, but searches for the next hit in that docset.

The problem with this approach is that we have Docset types with vastly different cost for seek which is not reflected here.

E.g. fast field range queries may do a full scan, if there is no next hit. phrase queries load the positions to check if a we have a hit.

Ideally we would use the "cheapest" docset as the driver for the intersection, and all others just check if they have a hit for the passed candidate. Some docsets, like a precomputed bitset are excellent drivers, but some, like fast field based range, queries aren't. Cheapness is typically related to number of docs in the docset only for same type of docsets. Between different docsets, we may have something like cost category (can be covered by e.g. using the high bits in u64).

To reflect that a cost based metric would make sense and an api that doesn't advance past target.

Proposal: Seek Exact + Cost Based

pub trait DocSet: Send {
    ...
    /// Returns a best-effort hint of the
    /// length of the docset.
    fn size_hint(&self) -> u32;

    /// Returns a best-effort hint of the
    /// cost to drive the docset.
    fn cost(&self) -> u64{
        self.size_hint()
    }

    /// Seeks to the target if necessary and checks if the target is an exact match.
    ///
    /// Some implementations may choose to advance past the target if beneficial for performance.
    /// The return value is `true` if the target is in the docset.
    fn seek_exact(&mut self, target: DocId) -> bool {
        let current_doc = self.doc();
        if current_doc < target {
            self.seek(target);
        }
        self.doc() == target
    }
   ...
}

Related: https://github.com/quickwit-oss/tantivy/issues/2266

PSeitz avatar Oct 25 '24 11:10 PSeitz

There is also the two phase approach of Lucene.

fulmicoton avatar Oct 25 '24 12:10 fulmicoton

Yes, I had a look at that. TwoPhaseIterator::approximation returns a DocIdSetIterator where

The returned DocIdSetIterator is a superset of the matching documents, and each match 
needs to be confirmed with matches() in order to know whether it matches or not.

This should work good for PhraseQueries, and it seems this was designed with them in mind, since they have kind of two phases (match on the docset + match on the positions). But for the FastFieldRangeQuery this would still forward the docset and produce the same issue as we have now.

The change and handling of the API would also probably have more overhead.

PSeitz avatar Oct 25 '24 12:10 PSeitz