tantivy
tantivy copied to clipboard
Ngram + Stemmer combination
Using ngram in combination with the stemmer seems to create weird results. Considering the following setup:
Using tantivy: 0.21.0
let ngram = NgramTokenizer::all_ngrams(3, 8).unwrap();
let mut text = TextAnalyzer::builder(ngram)
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser)
.filter(Stemmer::new(Language::English))
.build();
Putting in the text September October turns this into:
List of tokens
sep
sept
sept
septem
septemb
septemb
ept
ept
eptem
eptemb
eptemb
eptemb
pte
ptem
ptemb
ptemb
ptember
ptember
tem
temb
temb
tember
tember
tember o
emb
emb
ember
ember
ember o
ember oc
mbe
mber
mber
mber o
mber oc
mber oct
ber
ber
ber o
ber oc
ber oct
ber octo
er
er o
er oc
er oct
er octo
er octob
r o
r oc
r oct
r octo
r octob
r octob
oc
oct
octo
octob
octob
octob
oct
octo
octob
octob
octob
cto
ctob
ctobe
ctober
tob
tobe
tober
obe
ober
ber
I would somehow expect to have this split into September, October and then have the processing on the individual tokens.
Do you have a reference for a ngram tokenizer that ends the ngram on whitespace?
RemoveLongFilter::limit(40) doesn't make sense, a ngram token will never have that length. .filter(Stemmer::new(Language::English)) will give unexpected results.
Do you have a reference for a ngram tokenizer that ends the ngram on whitespace?
The example above?
.filter(Stemmer::new(Language::English)) will give unexpected results
Yea, I noticed that :D
Maybe the approach is wrong? Maybe I need something like:
SimpleTokenizer -> RemoveLongFilter -> (Ngram).chain(Stemmer)
I am just not sure how to get there.
I meant a reference that does the tokenization in September, October you suggested.
I am just not sure how to get there.
I'm not sure TextAnalzyer can do that currently. You could write you own Tokenizer.
I meant a reference that does the tokenization in
September,Octoberyou suggested.
That's the SimpleTokenizer one. It gives me:
september
october
SimpleTokenizer is not an ngram tokenizer
No it is not. I am sorry, but then I don't understand your question.
So I can guess I can come close to that by somehow reversing the API:
let ngram = NgramTokenizer::all_ngrams(3, 8).unwrap();
let mut text = TextAnalyzer::builder(
Stemmer::new(Language::English)
.transform(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default())))
.chain(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default())))
.chain(LowerCaser.transform(ngram)),
)
.build();
That still gives me things like ber octo, but mostly works.
Trying to paraphrase what PSeitz is trying to say: This is the expected behaviour of what is generally called an ngram tokenizer, i.e. it will not care about whitespace. The question for a reference would be some other system, like Lucene/Elasticsearch, which provides such an ngram tokenizer because having a reference would tell us a) such an API part of the state of art and b) give us hints how to add it here if we wanted to.
But indeed in this case, what you are looking for is build your own Tokenizer. Probably by wrapping a TextAnalyzer based on SimpleTokenizer, Stemmer, etc. and then applying NgramTokenizer to the token resulting from that so that you end up with the n-grams of "septemb" and "octob" instead of those of "september october".
Trying to paraphrase what PSeitz is trying to say: This is the expected behaviour of what is generally called an ngram tokenizer, i.e. it will not care about whitespace. The question for a reference would be some other system, like Lucene/Elasticsearch, which provides such an ngram tokenizer because having a reference would tell us a) such an API part of the state of art and b) give us hints how to add it here if we wanted to.
I think there is: https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
The idea we have in mind is to search for sub-words. Like having a text containing SomeHttpConnection, searching for http to find it.
I think it makes sense to combine multiple tokenizers (chaining) to have it split into words (like the Simple one) but then also doing ngrams and stemmers individually on those.
It feels like all of those components are there, but are sometimes implemented a Tokenizers, then as TokenFilter and then the TextAnalyzer. But it doesn't seem to possible to compose the desired behavior as they APIs don't work well together.
Ideally I would want to create some pipeline, like mentioned above.
The idea we have in mind is to search for sub-words. Like having a text containing SomeHttpConnection, searching for http to find it.
In that case, you might want to look at SplitCompoundWords which will split based on user-supplied dictionary. This can be more efficient compare to the more brute-force approach of using n-grams but its success depends entirely on the quality of the dictionary.
(In this particular example, you might actually want to build a TokenFilter that splits camel case identifiers but I am not sure whether this encompasses your whole use case.)
In that case, you might want to look at
SplitCompoundWordswhich will split based on user-supplied dictionary. This can be more efficient compare to the more brute-force approach of using n-grams but its success depends entirely on the quality of the dictionary.
Unfortunately, we don't have a dictionary. So that' doesn't really work well.
(In this particular example, you might actually want to build a TokenFilter that splits camel case identifiers but I am not sure whether this encompasses your whole use case.)
I guess that would actually be one way to deal with this. I think it would be great to have more tooling around composing tokenizers and filters. I raised a PR to chain two (or more) tokenizers: https://github.com/quickwit-oss/tantivy/pull/2304 … I believe that's generic enough.