PSeitz
PSeitz
You need to add the [StopWordFilter](https://docs.rs/tantivy/0.21.1/tantivy/tokenizer/struct.StopWordFilter.html) to your tokenizer
I don't think this is properly documented
> This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using...
> Hello, I'd like to know how to proceed with this pull request, it has been pending for a while. Sorry I forgot about this PR. It's fine to ping...
> Merge is done, do you have an example for benchmark that also reports memory consumption? I ran `cargo bench` but saw no indication of memory consumption. Thanks! memory reporting...
https://github.com/quickwit-oss/tantivy/pull/2378 is now merged
The memory consumption increased by 10% for those two benchmarks. I don't think you should pay for extended stats if you are not using it ``` terms_many_with_avg_sub_agg Memory: 29.0 MB...
here's a diff on the profile. There's some noise, but we can see the alloc overhead for the array (array itself + multiple strings), slower deserialization, slower value_from_json. `9.61% +4.74%...
- We could avoid the `Document` cfree overhead, by just referencing the unserialized json text blocks (similar to [serde_json_borrow](https://github.com/PSeitz/serde_json_borrow)). It would either require lifetimes on the tantivy's `Value` though. -...
https://github.com/quickwit-oss/tantivy/pull/2062 removes the `Token::default` creation for each text encountered, by having a shared one on the tokenizer ``` index-hdfs/index-hdfs-no-commit time: [644.80 ms 648.41 ms 652.51 ms] change: [-4.1369% -2.7259% -1.3721%]...