tantivy
tantivy copied to clipboard
Implementing Block WAND optimization for more queries
Is your feature request related to a problem? Please describe.
I only have one particular query in my system: a BooleanQuery
of TermQuery
s for the should, must, and must not clauses where only the should clauses contribute to scoring via BM25.
I noticed that a BooleanQuery
of TermQuery
s implements the Block WAND optimization within BooleanWeight::for_each_pruning
when it only has should clauses, but it falls back to the generic for_each_pruning_scorer
when there are any must or must not clauses.
Describe the solution you'd like
I have a prototype for our usage of Tantivy within Convex that is able to utilize Block WAND even with should or must not clauses, and I'd be curious if you would be interested in us upstreaming it!
The high level idea is to move the for_each_pruning
API from the weights to the scorers. First, we add an optional method to the Scorer
trait:
pub trait Scorer: DocSet {
fn adjust_threshold(&mut self, threshold: Score);
}
This method mutates the document set, filtering out documents that have score less than or equal to threshold
past the current position. The caller must call this method with monotonically increasing thresholds over the lifetime of the scorer.
Then, the protocol between the collector and the scorer moves from being callback-based (like for_each_pruning_scorer
today) to alternating between advancing the document set, updating the collector, and updating the thresholds.
impl Collector for TopDocs {
fn collect_segment(...) -> crate::Result<...> {
...
let mut top_n = TopNComputer::new(heap_len);
loop {
let doc = scorer.doc();
if doc == TERMINATED {
break;
}
top_n.push(scorer.score());
let threshold = top_n.threshold.unwrap_or(Score::MIN);
scorer.adjust_threshold(threshold);
}
}
}
Next, we implement a specialized WeakAnd
scorer that takes most of the logic from today's block_wand.rs
.
pub struct WeakAnd {
scorers: Vec<TermScorerWithMaxScore>,
threshold: Score,
// Invariant: If `scorers` is nonempty, this is present with the current
// match.
current_match: Option<WeakAndMatch>,
}
struct WeakAndMatch {
len: usize,
doc: DocId,
score: Score,
}
(Full source: https://gist.github.com/sujayakar/c742f07dd99089f4c529a242c3b5ac9c)
The idea here is that we run the Block WAND algorithm for generating pivot candidates in both DocSet::advance
and DocSet::seek
. So, a layer above can call WeakAnd::seek
on some doc: DocId
, and it will then resume where it left off, finding the next pivot after doc
that's better than the current threshold
.
Then, this approach composes well with query::intersection::Intersection
: If we're computing the intersection of WeakAnd
with a single TermScorer
(which we'll assume has a smaller size_hint
), the intersection will...
- Iterate to the next document on the
TermScorer
to get acandidate
- Advance the
WeakAnd
to thiscandidate
, returningseek_doc
. This step potentially skips over manyDocId
s in the union if the intersection is more selective. - If
seek_doc > candidate
, advance theTermScorer
and retry. In this condition, theWeakAnd
having a high threshold potentially lets us skip over manyDocId
s in the intersection. - Otherwise, emit
candidate
.
A similar approach works for Exclude
, which we use for the must not clauses.
I haven't tested this approach rigorously yet, and I also haven't evaluated how it performs compared to the current callback-based API. But if you are interested in this work, I can clean it up and start evaluating it. I'd also be curious how this adjust_threshold
approach works with other score combiners too.