tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Tokenizer API allocations

Open PSeitz opened this issue 2 years ago • 3 comments

Currently the tokenizer api generates a lot of allocations.

For every Text encountered text_analyzer::token_stream() is called


impl TextAnalyzer {
    /// Creates a token stream for a given `str`.
    pub fn token_stream<'a>(&self, text: &'a str) -> BoxTokenStream<'a> {
        self.tokenizer.box_token_stream(text)
    }
}

A boxed token stream typically creates a Token:

impl Default for Token {
    fn default() -> Token {
        Token {
            offset_from: 0,
            offset_to: 0,
            position: usize::MAX,
            text: String::with_capacity(200),
            position_length: 1,
        }
    }
}

PSeitz avatar May 19 '23 09:05 PSeitz

This PR https://github.com/quickwit-oss/tantivy/pull/2062 fixes this mostly. Only allocation is now the BoxTokenStream per text, which could be avoided with some lifetime hacks (and unsafe probably).

PSeitz avatar Jun 09 '23 04:06 PSeitz

Can we close this?

fulmicoton avatar Jul 10 '23 00:07 fulmicoton

It would be nice to remove the BoxTokenStream allocation per text and use the Tokenizer directly. e.g. set_text on the Tokenizer and then get the tokens from Tokenizer directly

PSeitz avatar Jul 10 '23 01:07 PSeitz