Jens Reimann comments

Results 627 comments of


                                            Jens Reimann

[Bug]: unable to initialize

> An improvement I can imagine we could add is to look for step in the path when this message is returned to the user. If it's not found, we...

feat: add a simple way to chain two tokenizers

I applied nightly `rustfmt`.

feat: add a simple way to chain two tokenizers

My use case is to have all simple tokens plus all ngrams.

feat: add a simple way to chain two tokenizers

I was able to incorporate most of the feedback you mentioned. It's less explicit without the `enum`, but works the same way. There was just one call to `second.advance()` missing,...

feat: add a simple way to chain two tokenizers

> If you feel like code-golfing, I think those two calls to second.advance() could even be merged I like that, pushed. So, the remaining thing seems to be the position....

feat: add a simple way to chain two tokenizers

Fixed the test issue.

Ngram + Stemmer combination

> Do you have a reference for a ngram tokenizer that ends the ngram on whitespace? The example above? > .filter(Stemmer::new(Language::English)) will give unexpected results Yea, I noticed that :D...

Ngram + Stemmer combination

> I meant a reference that does the tokenization in `September`, `October` you suggested. That's the `SimpleTokenizer` one. It gives me: ``` september october ```

Ngram + Stemmer combination

No it is not. I am sorry, but then I don't understand your question.

So I can guess I can come close to that by somehow reversing the API: ```rust let ngram = NgramTokenizer::all_ngrams(3, 8).unwrap(); let mut text = TextAnalyzer::builder( Stemmer::new(Language::English) .transform(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default()))) .chain(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default()))) .chain(LowerCaser.transform(ngram)),...