tantivy
tantivy copied to clipboard
Fast field on string fields
Is your feature request related to a problem? Please describe. Basically a fast-field for strings. Columnar storage.
Describe the solution you'd like
If you have a TermDictionary for the field, you should be able to use the same code for fast-fields-uint64. You just store the term ordinal as fast-field value, and use it to get the real term string from the dictionary.
[Optional] describe alternatives you've considered Apache arrow (they probably do the same thing underneath).
Makes sense ?
Yes it makes sense. It's actually very little work, as the hack exists already for facets.
Note that you also have &[u8] fast fields, if your strings tend to be unique.
In order to assess what is the priority for this, do you personally have a use case for it?
This is needed to get the value from fast-fields and not use stored.
In cases where the primary key is multi-field (user_id, timestamp), this should be able to get/store user_id very efficiently (since it'll be repeated a lot).
And if you sort the index by timestamp, should do the same with it. (needs index sorting..)
But will also need something like global_ordinals that es does: https://www.elastic.co/guide/en/elasticsearch/reference/master/eager-global-ordinals.html ?
global_ordinals are extremely complicated, and have some side effects.
Also they are very likely not required for your use case.
You can still do top K using term ordinal on a per-segment basis, and merge the top-Ks by resolving
term-ord -> term K times per segment.
In average how many times is user_id repeated in a given segment?
From full-segment with 1 user_id, to minimum ~150 same values.
Even based on your blog post that you wanted to index open-crawl, globally sorting the urls, will give you fast-field on domain,tld,subdomain,path very efficiently.
Even based on your blog post that you wanted to index open-crawl, globally sorting the urls, will give you fast-field on domain,tld,subdomain,path very efficiently.
- globally sorted urls will probably not happen any years soon. This feature is complicated and comes at a cost.
- sorting the index by a given field is a great feature. There is a ticket about that... but again, this won't happen any time soon, because we do not have a user of that scale yet.
Now let's keep the discussion focused on your feature request.
From full-segment with 1 user_id, to minimum ~150 same values.
I don't understand this sentence. What does full segment mean?
Again this feature is useful... It will likely happen eventually, but I need to assess if it should be high priority or not, and that depends on whether you have an actual use case or not.
Is your use case large enough to make &[u8] fast fields or stored fields not efficient enough ?
globally sorted urls will probably not happen any years soon. This feature is complicated and comes at a cost.
Globally sorted happens outside of tantivy (think range sharding vs hash sharding in es, while lucene doesn't care). Index sorting happens at segment-level.
sorting the index by a given field is a great feature. There is a ticket about that... but again, this won't happen any time soon, because we do not have a user of that scale yet.
Makes sense.
I don't understand this sentence. What does full segment mean? All docs in the segment have same
user_id.
Again this feature is useful... It will likely happen eventually, but I need to assess if it should be high priority or not, and that depends on whether you have an actual use case or not. Is your use case large enough to make &[u8] fast fields or stored fields not efficient enough ?
It means what you intend tantitvy to be. How to differentiate. If it's performance, then yes. If not performance, then not.
My idea would be to focus on performance, so some high-scale user comes in and contributes. Many scalable sites either hack lucene or build their own c++ version. Anything else, they can go lucene/es/solr.
API
STRING is taking the untokenized content of a field value ("raw" tokenizer), e.g. "Cool Nice" -> "Cool Nice" TEXT is tokenizing the content of a field value with the "default" tokenizer, e.g. "Cool Nice" -> "Cool" "Nice"
I would start to add support for FAST on STRING fields, since fast fields on tokenized text is not what the user needs usually (https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#before-enabling-fielddata)
Scope
Main usage would be aggregation on terms and sorting by a field.
Limitations
- FAST on any tokenizer which is not the ''raw" tokenizer
- In some scenarios you want to have text tokenized + FAST on STRING, e.g. book authors. The names should be tokenized, but aggregation would be on the "String" value.
Alternative API
We could introduce a Keyword type like elastic does, which would roughly equal STRING + FAST. I'm not sure if this improves the API.
Multimappings would also make sense in that context, e.g. to have Text + Keyword on one field. That would certainly be a win in some scenarios, e.g. the Author use case.
Internally
Untokenized fields would mean we have a 1:1 cardinality and we can store everything in a single value u64 fast field.
To reduce space usage on sparse string values, a null value should have assigned a 0 in the fast field. That means the algorithm that assigns unorderdered_term_ids and term_ids should start at 1.
Reuse Facets?
The Facets type in tantivy is similarto what we need, since it creates a fast field with the term ids. But it uses a mutli value fast field, which makes sense for (hierachical) facets, since they are inherently multi valued.
Multi value fast fields have two indices, the doc index pointing to the values range and the actual values.
As mentioned above, typical use case is single valued, so as a tendency I would not leverage facet reuse here.
Facet Cons
- Performance overhead; two lookups instead of one
- Space overhead: two indices instead of one
- Using Facet is not straightforward, e.g. conversion to Facet("/"string) needed
Facet Pros
- There is existing tested code
- No special logic required for Null values
- Extending support for TEXT | FAST is easier. (Although niche use case)
Performance overhead; two lookups instead of one
Considering our docvalue codec is already dynamic couldn't we remove the extra cost? If a fastfield is multivalued according to the schema but only contains single valued, it could be encoded as single value?
Or quite the opposite, should we introduce the notion of cardinality in tantivy?
A field would be optional, required or repeated like for protobuf.
One benefit would be a better json serialization.
You did not tell how you wanted to enforce the absence of tokenization. Is it a panic at runtime? Also this is also a feature in quickwit. Can you open a ticket in quickwit too and try to write a similar document?
Considering our docvalue codec is already dynamic couldn't we remove the extra cost? If a fastfield is multivalued according to the schema but only contains single valued, it could be encoded as single value?
The dynamic codec is currently only dynamic one layer below, e.g. for the multi values, two dynamic index (index and data), and for single value indices one dynamic on the values.
We could introduce detection for multi values to be single values, but there's also behavioral difference currently, with val_if_missing, which needs to be set upfront and only exists on single value fast fields.
There's also some overhead when reading multiple values into a vec instead a single one by value directly.
Or quite the opposite, should we introduce the notion of cardinality in tantivy? A field would be
optional, required or repeatedlike for protobuf. One benefit would be a better json serialization.
I like the idea, but this seems to be out of scope of this issue, since we can have multi values on a non-repeated field due to tokenization.
You did not tell how you wanted to enforce the absence of tokenization. Is it a panic at runtime?
Do yo mean with FAST and indexing_options to None? Either return an error or automatic indexing activated with FAST.
I prefer to always have indexing with the raw tokenizer automatically.
Also this is also a feature in quickwit. Can you open a ticket in quickwit too and try to write a similar document?
Yes, although usage in the docmapper much depends on the api in tantivy
Implemented in https://github.com/quickwit-oss/tantivy/pull/1325