tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Add shingle token filter or token n-grams

Open fmassot opened this issue 3 years ago • 3 comments

I thought this was present in tantivy but for now, there is only a NgramTokenizer that tokenizes words into the n-grams.

Lucene offers a ShinglerFilter which creates shingles or token n-grams, it creates combinations of tokens and not letters.

For example, this dataset published token n-grams and that would be interesting to index it with tantivy instead of having some SQL dump.

fmassot avatar Nov 14 '21 00:11 fmassot

Hi, I was also trying to implement a shingle filter. I left a PR, but it's incomplete - I tried to explain where I've stuck in the description.

mocobeta avatar Nov 14 '21 04:11 mocobeta

@fmassot I am not sure I understand how the shingle filter could help with the ngram dataset.

fulmicoton avatar Nov 15 '21 07:11 fulmicoton

@fulmicoton ah yes, that was not clear, my idea was to be able to process directly articles contents not the ngram dataset which is there because of legal constraints.

fmassot avatar Nov 19 '21 22:11 fmassot