tantivy
tantivy copied to clipboard
Tokenizer API allocations
Currently the tokenizer api generates a lot of allocations.
For every Text encountered text_analyzer::token_stream() is called
impl TextAnalyzer {
/// Creates a token stream for a given `str`.
pub fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
self.tokenizer.box_token_stream(text)
}
}
A boxed token stream typically creates a Token:
impl Default for Token {
fn default() -> Token {
Token {
offset_from: 0,
offset_to: 0,
position: usize::MAX,
text: String::with_capacity(200),
position_length: 1,
}
}
}
This PR https://github.com/quickwit-oss/tantivy/pull/2062 fixes this mostly.
Only allocation is now the BoxTokenStream per text, which could be avoided with some lifetime hacks (and unsafe probably).
Can we close this?
It would be nice to remove the BoxTokenStream allocation per text and use the Tokenizer directly. e.g. set_text on the Tokenizer and then get the tokens from Tokenizer directly