tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Add a fast-field variant of TermSet

Open stuhood opened this issue 4 months ago • 5 comments

The TermSet Query currently produces one Scorer/DocSet per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a HashSet of (encoded) term values.

Following the pattern set by the two execution modes of RangeQuery, this PR introduces a variant of TermSet which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large TermSets of primitives.

stuhood avatar Oct 15 '25 23:10 stuhood

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

PSeitz avatar Oct 16 '25 09:10 PSeitz

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

Hm, interesting.

That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.

For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.

stuhood avatar Oct 16 '25 16:10 stuhood

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

Hm, interesting.

That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.

For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.

The BitSet would be precomputed on the inverted index. I think we would need a benchmark to justify the additional code. (also good to know :)

PSeitz avatar Oct 26 '25 18:10 PSeitz

I think we would need a benchmark to justify the additional code.

FWIW: The benchmarks that I did in ParadeDB for this PR are over here: https://github.com/paradedb/paradedb/pull/3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.

stuhood avatar Oct 26 '25 22:10 stuhood

I think we would need a benchmark to justify the additional code.

FWIW: The benchmarks that I did in ParadeDB for this PR are over here: paradedb/paradedb#3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.

I mean comparing the BitSet variant (via BitSetDocSet) with the fastfield variant. In the BitSet variant we would fill the BitSet one by one from the postinglists, which would fix the memory usage to num_docs/8 bytes and remove all seeking.

PSeitz avatar Oct 27 '25 08:10 PSeitz