lunr-languages
lunr-languages copied to clipboard
Problem in spanish: doesn't work if word isn't using accent mark.
Let's say I have an index created. the spanish word "Respiración" is stemmed as: "respir"
Thats correct.
Now, I make a search, but the user doesn't use the accent mark, and he types: "respiracion" (without acent on last "o"). So lunr won't stem that word and it will let it as "respiracion", so no matches will be found.
I know that a basis around stemming is that the word is correctly spelled, BUT as nearly no user type accents correctly when searching for a string, this is really making lunr useless for many words.
I made a workaround, that is removing accents before stemmer in the pipeline (I remove accents with the use of normalize-strings
.
But this also removes lot of benefits from stemming, because those words will never be stemmed.
var normalize = require('normalize-strings');
var normalizeLunrPlugin = function(builder, stemmer) {
var pipelineFunction = function(token) {
return token.update(function(word) {
var normalized = normalize(word);
return normalized;
});
};
// Register the pipeline function so the index can be serialised
lunr.Pipeline.registerFunction(pipelineFunction, 'normalizeLunrPlugin');
// Add the pipeline function to both the indexing pipeline and the
// searching pipeline
builder.pipeline.before(stemmer, pipelineFunction);
builder.searchPipeline.before(stemmer, pipelineFunction);
};
My suggestion is that two stemmers, with both accented and no-accented words run in the pipeline, so that the word "respiracion" without accents, that the first stemmer will leave intact, is picked by the second one and stemmed correctly...