tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

New collector - `DistinctFilterCollector`

Open ChillFish8 opened this issue 11 months ago • 2 comments

Opening this issue because it is probably worth some discussion and wonder if you have any ideas or possible solutions to the issue.

Premise

Often when running queries, you may want to deduplicate returned results and exclude them from being considered in things like the TopK collector, this is especially prevalent when working with mutable data within Tantivy (where you might have a primary key), as often it is more efficient to lazily delete old documents rather than greedily executing a delete every time you receive a new document.

Unfortunately, this operation is difficult for users to implement unless they know quite a bit about how collectors work and the collector itself is quite complicated with a few issues to discuss (further down.)

The DistinctFilterCollector would act similarly to the FilterCollector but specifically only allow documents with values it hasn't already seen go through.

Possible syntax

let collector = DistinctFilterCollector::by_fields(vec![primary_key_field], TopDocs::with_limit(10));
let top_docs = searcher.search(&my_query,  &collector);

In an ideal world, I'd also like this collector to return both the Collector::Fruit of the inner collector it wraps and a set of document addresses it filtered out, this would allow users a bit more flexibility around the application of this collector (for example, removing now "old" documents.)

This would make the syntax closer to:

let collector = DistinctFilterCollector::by_fields(vec![primary_key_field], TopDocs::with_limit(10));
let (top_docs, ignored_docs) = searcher.search(&my_query,  &collector);  // (TopDocs::Fruit, Vec<DocAddress>)

Problems

  • From my understanding of collectors, this collector would only be able to work on a first-come-first-served basis.
  • If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)

ChillFish8 avatar Jan 05 '25 21:01 ChillFish8

If you wanted to the collector to behave by taking the most recent distinct document rather than first-come-first-served, segments would need to always be searched consistently and merges would also need to maintain this order (not sure if merges do this already or not.)

I would... not care. As you mention, merges make it more or less impossible.

fulmicoton avatar Jan 06 '25 09:01 fulmicoton

It may be interesting to look at the way lucene does it

https://lucene.apache.org/core/8_0_0/grouping/org/apache/lucene/search/grouping/DistinctValuesCollector.html

https://github.com/apache/lucene/blob/7425e43fa363f6665d5d363d032d3b1cb3242508/lucene/grouping/src/java/org/apache/lucene/search/grouping/DistinctValuesCollector.java#L35

Barre avatar Feb 05 '25 21:02 Barre