tantivy
tantivy copied to clipboard
FastField Filtering
I am currently using a single term TermQuery with a boost of 0 and with fieldnorms disabled to filter out the records using boolean query MUST clause.
When I include the MUST clause on the fast field for filtering, the query takes around 60ms on an index containing 10 million documents. However, when I remove this MUST clause, the same query executes in less than 40ms.
Is there any optimization I can apply to reduce the latency impact of the fast field filter? Ideally, I would like the filter on the fast field to have minimal or no effect on the overall response time.
~Many production Quickwit setups are using ARM and it performs well. Maybe @PSeitz has a sharper insight on how well vectorization works on columnar data.~
(sorry, my comment was meant for https://github.com/quickwit-oss/tantivy/issues/2643, I have no idea ho it ended up being posted here)
Can you share more information about the query? Do you use the query parser or build the query yourself? Is your MUST condition a range query?
There are two ways to apply a filter, via the fast field (columnar storage) or via the inverted index. Which one is faster depends on several factors, some optimization may come here: https://github.com/quickwit-oss/tantivy/pull/2538.
@PSeitz
This is the query I am using. For Filter its a Must Clause on a single term with Const Score of 0.
BooleanQuery { subqueries: [ ( Must, Const( score = 0, query = TermQuery( Term(field = 15, type = U64, 85688637) ) ) ), ( Must, Boost( query = BooleanQuery { subqueries: [ ( Should, Boost( query = TermQuery( Term(field = 17, type = U64, 90815) ), boost = 0.11 ) ), ( Should, Boost( query = TermQuery( Term(field = 0, type = Str, "5051") ), boost = 0.367 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "5051") ), boost = 1.102 ) ), ( Should, BooleanQuery { subqueries: [ ( Must, Boost( query = TermQuery( Term(field = 2, type = Str, "102") ), boost = 1.224 ) ), ( Should, Boost( query = TermQuery( Term(field = 1, type = Str, "apt") ), boost = 0.157 ) ) ], minimum_number_should_match: 0 } ), ( Should, Boost( query = TermQuery( Term(field = 3, type = Str, "5051") ), boost = 1.102 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "garford") ), boost = 2.4569998 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "street") ), boost = 0.091 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "st") ), boost = 0.091 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "long") ), boost = 2.372 ) ), ( Should, Boost( query = TermQuery( Term(field = 6, type = Str, "beaches") ), boost = 2.622 ) ) ], minimum_number_should_match: 1 }, boost = 5.0 ) ) ], minimum_number_should_match: 0 }
Do you use the fast field or the inverted index for your filter? The inverted index seems more suitable here
@PSeitz I am using Fast field for the filter. It is a read only index with total 120 million records and 20 million records per segment
( Must, Const( score = 0, query = TermQuery( Term(field = 15, type = U64, 85688637) ) )
{
"name": "admin1_ff",
"type": "u64",
"options": {
"indexed": true,
"fieldnorms": false,
"fast": true,
"stored": false
}
},
Field 15 is a u64 field with Fast, indexed and field norm false
Do you think using inverted index will be faster?
fast means a columnar index is created (fast is a misnomer), indexed creates the inverted index. The inverted index is already used for regular term queries.