Add a fast-field variant of TermSet
The TermSet Query currently produces one Scorer/DocSet per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a HashSet of (encoded) term values.
Following the pattern set by the two execution modes of RangeQuery, this PR introduces a variant of TermSet which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).
Performance is significantly improved for large TermSets of primitives.
I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight
I think a precomputed
BitSetshould also work well, similar to what I did in regex_phrase_weight
Hm, interesting.
That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.
For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.
I think a precomputed
BitSetshould also work well, similar to what I did in regex_phrase_weightHm, interesting.
That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.
For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.
The BitSet would be precomputed on the inverted index. I think we would need a benchmark to justify the additional code. (also good to know :)
I think we would need a benchmark to justify the additional code.
FWIW: The benchmarks that I did in ParadeDB for this PR are over here: https://github.com/paradedb/paradedb/pull/3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.
I think we would need a benchmark to justify the additional code.
FWIW: The benchmarks that I did in ParadeDB for this PR are over here: paradedb/paradedb#3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.
I mean comparing the BitSet variant (via BitSetDocSet) with the fastfield variant.
In the BitSet variant we would fill the BitSet one by one from the postinglists, which would fix the memory usage to num_docs/8 bytes and remove all seeking.