tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Ngram + Stemmer combination

Open ctron opened this issue 1 year ago • 11 comments

Using ngram in combination with the stemmer seems to create weird results. Considering the following setup:

Using tantivy: 0.21.0

let ngram = NgramTokenizer::all_ngrams(3, 8).unwrap();
let mut text = TextAnalyzer::builder(ngram)
  .filter(RemoveLongFilter::limit(40))
  .filter(LowerCaser)
  .filter(Stemmer::new(Language::English))
  .build();

Putting in the text September October turns this into:

List of tokens
sep
sept
sept
septem
septemb
septemb
ept
ept
eptem
eptemb
eptemb
eptemb
pte
ptem
ptemb
ptemb
ptember
ptember 
tem
temb
temb
tember
tember 
tember o
emb
emb
ember
ember 
ember o
ember oc
mbe
mber
mber 
mber o
mber oc
mber oct
ber
ber 
ber o
ber oc
ber oct
ber octo
er 
er o
er oc
er oct
er octo
er octob
r o
r oc
r oct
r octo
r octob
r octob
 oc
 oct
 octo
 octob
 octob
 octob
oct
octo
octob
octob
octob
cto
ctob
ctobe
ctober
tob
tobe
tober
obe
ober
ber

I would somehow expect to have this split into September, October and then have the processing on the individual tokens.

ctron avatar Jan 18 '24 13:01 ctron

Do you have a reference for a ngram tokenizer that ends the ngram on whitespace?

RemoveLongFilter::limit(40) doesn't make sense, a ngram token will never have that length. .filter(Stemmer::new(Language::English)) will give unexpected results.

PSeitz avatar Jan 18 '24 14:01 PSeitz

Do you have a reference for a ngram tokenizer that ends the ngram on whitespace?

The example above?

.filter(Stemmer::new(Language::English)) will give unexpected results

Yea, I noticed that :D

Maybe the approach is wrong? Maybe I need something like:

SimpleTokenizer -> RemoveLongFilter -> (Ngram).chain(Stemmer)

I am just not sure how to get there.

ctron avatar Jan 18 '24 14:01 ctron

I meant a reference that does the tokenization in September, October you suggested.

I am just not sure how to get there.

I'm not sure TextAnalzyer can do that currently. You could write you own Tokenizer.

PSeitz avatar Jan 18 '24 14:01 PSeitz

I meant a reference that does the tokenization in September, October you suggested.

That's the SimpleTokenizer one. It gives me:

september
october

ctron avatar Jan 18 '24 14:01 ctron

SimpleTokenizer is not an ngram tokenizer

PSeitz avatar Jan 18 '24 14:01 PSeitz

No it is not. I am sorry, but then I don't understand your question.

ctron avatar Jan 18 '24 14:01 ctron

So I can guess I can come close to that by somehow reversing the API:

let ngram = NgramTokenizer::all_ngrams(3, 8).unwrap();
let mut text = TextAnalyzer::builder(
    Stemmer::new(Language::English)
        .transform(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default())))
        .chain(LowerCaser.transform(RemoveLongFilter::limit(40).transform(SimpleTokenizer::default())))
        .chain(LowerCaser.transform(ngram)),
)
.build();

That still gives me things like ber octo, but mostly works.

ctron avatar Jan 18 '24 15:01 ctron

Trying to paraphrase what PSeitz is trying to say: This is the expected behaviour of what is generally called an ngram tokenizer, i.e. it will not care about whitespace. The question for a reference would be some other system, like Lucene/Elasticsearch, which provides such an ngram tokenizer because having a reference would tell us a) such an API part of the state of art and b) give us hints how to add it here if we wanted to.

But indeed in this case, what you are looking for is build your own Tokenizer. Probably by wrapping a TextAnalyzer based on SimpleTokenizer, Stemmer, etc. and then applying NgramTokenizer to the token resulting from that so that you end up with the n-grams of "septemb" and "octob" instead of those of "september october".

adamreichold avatar Jan 18 '24 15:01 adamreichold

Trying to paraphrase what PSeitz is trying to say: This is the expected behaviour of what is generally called an ngram tokenizer, i.e. it will not care about whitespace. The question for a reference would be some other system, like Lucene/Elasticsearch, which provides such an ngram tokenizer because having a reference would tell us a) such an API part of the state of art and b) give us hints how to add it here if we wanted to.

I think there is: https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html

The idea we have in mind is to search for sub-words. Like having a text containing SomeHttpConnection, searching for http to find it.

I think it makes sense to combine multiple tokenizers (chaining) to have it split into words (like the Simple one) but then also doing ngrams and stemmers individually on those.

It feels like all of those components are there, but are sometimes implemented a Tokenizers, then as TokenFilter and then the TextAnalyzer. But it doesn't seem to possible to compose the desired behavior as they APIs don't work well together.

Ideally I would want to create some pipeline, like mentioned above.

ctron avatar Jan 18 '24 16:01 ctron

The idea we have in mind is to search for sub-words. Like having a text containing SomeHttpConnection, searching for http to find it.

In that case, you might want to look at SplitCompoundWords which will split based on user-supplied dictionary. This can be more efficient compare to the more brute-force approach of using n-grams but its success depends entirely on the quality of the dictionary.

(In this particular example, you might actually want to build a TokenFilter that splits camel case identifiers but I am not sure whether this encompasses your whole use case.)

adamreichold avatar Jan 18 '24 17:01 adamreichold

In that case, you might want to look at SplitCompoundWords which will split based on user-supplied dictionary. This can be more efficient compare to the more brute-force approach of using n-grams but its success depends entirely on the quality of the dictionary.

Unfortunately, we don't have a dictionary. So that' doesn't really work well.

(In this particular example, you might actually want to build a TokenFilter that splits camel case identifiers but I am not sure whether this encompasses your whole use case.)

I guess that would actually be one way to deal with this. I think it would be great to have more tooling around composing tokenizers and filters. I raised a PR to chain two (or more) tokenizers: https://github.com/quickwit-oss/tantivy/pull/2304 … I believe that's generic enough.

ctron avatar Jan 19 '24 07:01 ctron