tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

FastField Filtering

Open anshulgoel27 opened this issue 6 months ago • 6 comments

I am currently using a single term TermQuery with a boost of 0 and with fieldnorms disabled to filter out the records using boolean query MUST clause.

When I include the MUST clause on the fast field for filtering, the query takes around 60ms on an index containing 10 million documents. However, when I remove this MUST clause, the same query executes in less than 40ms.

Is there any optimization I can apply to reduce the latency impact of the fast field filter? Ideally, I would like the filter on the fast field to have minimal or no effect on the overall response time.

anshulgoel27 avatar Jun 04 '25 17:06 anshulgoel27

~Many production Quickwit setups are using ARM and it performs well. Maybe @PSeitz has a sharper insight on how well vectorization works on columnar data.~

(sorry, my comment was meant for https://github.com/quickwit-oss/tantivy/issues/2643, I have no idea ho it ended up being posted here)

rdettai avatar Jun 10 '25 08:06 rdettai

Can you share more information about the query? Do you use the query parser or build the query yourself? Is your MUST condition a range query?

There are two ways to apply a filter, via the fast field (columnar storage) or via the inverted index. Which one is faster depends on several factors, some optimization may come here: https://github.com/quickwit-oss/tantivy/pull/2538.

PSeitz avatar Jun 10 '25 08:06 PSeitz

@PSeitz

This is the query I am using. For Filter its a Must Clause on a single term with Const Score of 0.

BooleanQuery { subqueries: [ ( Must, Const( score = 0, query = TermQuery( Term(field = 15, type = U64, 85688637) ) ) ), ( Must, Boost( query = BooleanQuery { subqueries: [ ( Should, Boost( query = TermQuery( Term(field = 17, type = U64, 90815) ), boost = 0.11 ) ), ( Should, Boost( query = TermQuery( Term(field = 0, type = Str, "5051") ), boost = 0.367 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "5051") ), boost = 1.102 ) ), ( Should, BooleanQuery { subqueries: [ ( Must, Boost( query = TermQuery( Term(field = 2, type = Str, "102") ), boost = 1.224 ) ), ( Should, Boost( query = TermQuery( Term(field = 1, type = Str, "apt") ), boost = 0.157 ) ) ], minimum_number_should_match: 0 } ), ( Should, Boost( query = TermQuery( Term(field = 3, type = Str, "5051") ), boost = 1.102 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "garford") ), boost = 2.4569998 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "street") ), boost = 0.091 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "st") ), boost = 0.091 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "long") ), boost = 2.372 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "beaches") ), boost = 2.622 ) ) ], minimum_number_should_match: 1 }, boost = 5.0 ) ) ], minimum_number_should_match: 0 }

anshulgoel27 avatar Jun 10 '25 11:06 anshulgoel27

Do you use the fast field or the inverted index for your filter? The inverted index seems more suitable here

PSeitz avatar Jun 10 '25 12:06 PSeitz

@PSeitz I am using Fast field for the filter. It is a read only index with total 120 million records and 20 million records per segment

( Must, Const( score = 0, query = TermQuery( Term(field = 15, type = U64, 85688637) ) )

    {
        "name": "admin1_ff",
        "type": "u64",
        "options": {
            "indexed": true,
            "fieldnorms": false,
            "fast": true,
            "stored": false
        }
    },

Field 15 is a u64 field with Fast, indexed and field norm false

Do you think using inverted index will be faster?

anshulgoel27 avatar Jun 10 '25 15:06 anshulgoel27

fast means a columnar index is created (fast is a misnomer), indexed creates the inverted index. The inverted index is already used for regular term queries.

PSeitz avatar Jun 11 '25 14:06 PSeitz