tantivy
tantivy copied to clipboard
Add shingle token filter or token n-grams
I thought this was present in tantivy but for now, there is only a NgramTokenizer
that tokenizes words into the n-grams.
Lucene offers a ShinglerFilter which creates shingles or token n-grams, it creates combinations of tokens and not letters.
For example, this dataset published token n-grams and that would be interesting to index it with tantivy instead of having some SQL dump.
Hi, I was also trying to implement a shingle filter. I left a PR, but it's incomplete - I tried to explain where I've stuck in the description.
@fmassot I am not sure I understand how the shingle filter could help with the ngram dataset.
@fulmicoton ah yes, that was not clear, my idea was to be able to process directly articles contents not the ngram dataset which is there because of legal constraints.