search-benchmark-game
search-benchmark-game copied to clipboard
Add filtering commands.
These new commands allow running queries against a filter that matches 1% or 10% of documents.
Filters are interesting because some optimizations that are easy/obvious for exhaustive evaluation become more complicated when a filter is applied. Yet filters are common, think of an e-commerce search filtered by category for instance.
Only the Lucene 10.0 engine supports filtering for now because I'm not too familiar with Rust, but I assume that it should be easy to add support for it to the Tantivy engine.
Thanks for the PR, filtering is a great addition.
I think the query side should be handled via the query list, with an added tag. The commands are more like different collectors.
This is how I started, but I would really like to see how all queries perform when a filter is applied, not just a few of them, and duplicating all queries didn't feel right (especially if duplicated twice, once for the 1% filter and another time for the 10% filter). This would also make it annoying to add more queries to the benchmark.
So my second idea was to add it as another dimension to the benchmark (one dimension is the command (=collector), another one the query, another one the filter density) but it felt a bit over-engineered. So I came to this 3rd approach of coupling it with the command, which didn't feel great at first, but now feels to me like the least worst approach?
I think duplicating should be fine, but we could have it in code when loading the queries. This has the advantage that you can easily get an overview and compare the different results with a single run.
We may add a FILTER_TAG option or similar to filter queries with certain tags.
Another things that's missing currently is searching on multiple fields, which is probably the much more common use case.
@PSeitz We'd also need to make sure the query language handles it though (filters should not impact scoring). It might be a pain.
I'd go with @jpountz solution for simplicity.
FWIW I have started looking at applying @PSeitz 's approach: https://github.com/quickwit-oss/search-benchmark-game/compare/master...jpountz:search-benchmark-game:filtered_queries?expand=1 in case you want to take a look (the query parsing bits are still missing).
@PSeitz We'd also need to make sure the query language handles it though (filters should not impact scoring). It might be a pain.
I'd go with @jpountz solution for simplicity.
Good point. I think the complexity should be the same for both, where we need to have a special query handling to pass in the filter, but the usage with the queries approach should be easier and would require just one run instead of 3 or 4 runs to get the full picture.