PSeitz
PSeitz
It would be nice to remove the `BoxTokenStream` allocation per text and use the Tokenizer directly. e.g. `set_text` on the Tokenizer and then get the tokens from Tokenizer directly
The bullet-point in the feature list `Natural query language` is probably misleading. You can't phrase questions using natural language like that in tantivy (at least not without customizing or additional...
Related issue https://github.com/quickwit-oss/tantivy/issues/1041
lz4 uses only duplicates for compressing data (no huffmann or ans like zstd) ```bash ➜ blub git:(main) ✗ lz4 datasets/split/346cb77c09e04022aee6c49077dbc821.idx Compressed filename will be: datasets/split/346cb77c09e04022aee6c49077dbc821.idx.lz4 Compressed 183824037 bytes into 147904079...
Some more data. Percentage of 4 byte pairs, scanned in 1 byte steps. Interestingly the same pattern (more than 10%) can be observed on `.idx`, but not `.pos` between github...
Yes, having more datasets would be nice. geo-data is probably a little special, since it ideally belongs to an own index that allows geo queries.
You can just add the field multiple times in the `Document`. ```rust index_writer.add_document(doc!( date_field => DateTime::from_timestamp_secs(1000), date_field => DateTime::from_timestamp_secs(1001), ))?; ```
I think we should change that to ```rust pub struct TokenizerManager { tokenizers: ArcSwap, } ``` While TextAnalyzer is actually a `TokenizerBuilder`
The parser doesn't handle this currently, but this should work `cart.product_id:
Indeed, it's disabled. I don't think there's a inherent reason, except some code missing to handle that. @fulmicoton?