stronglink
stronglink copied to clipboard
Stemming plugins
Our search indexer currently uses the Porter stemming algorithm from SQLite FTS3. We've already tweaked it to ignore underscores, but it still has several other limitations, mainly regarding languages aside from English and certain search terms (such as proper names that end in "s", or certain words).
The ideal solution would be automatically detecting the language of each word and stemming according to that language's grammar rules, but I don't know of such an algorithm that is publicly (and freely) available.
I think the practical approach is to let the user choose a custom stemmer for each repository. By default we could try to include the best stemmer for each natural language.
That still isn't ideal for bilingual users, of course.
I think SQLite already has some other stemmers available so if we stick to that interface we can support them quite easily.
See also https://sqlite.org/fts3.html#tokenizer