lunr-languages icon indicating copy to clipboard operation
lunr-languages copied to clipboard

Problem in spanish: doesn't work if word isn't using accent mark.

Open jigarzon opened this issue 5 years ago • 2 comments

Let's say I have an index created. the spanish word "Respiración" is stemmed as: "respir"

Thats correct.

Now, I make a search, but the user doesn't use the accent mark, and he types: "respiracion" (without acent on last "o"). So lunr won't stem that word and it will let it as "respiracion", so no matches will be found.

I know that a basis around stemming is that the word is correctly spelled, BUT as nearly no user type accents correctly when searching for a string, this is really making lunr useless for many words.

jigarzon avatar Oct 07 '19 16:10 jigarzon

I made a workaround, that is removing accents before stemmer in the pipeline (I remove accents with the use of normalize-strings.

But this also removes lot of benefits from stemming, because those words will never be stemmed.

var normalize = require('normalize-strings');


var normalizeLunrPlugin = function(builder, stemmer) {
  var pipelineFunction = function(token) {
    return token.update(function(word) {
      var normalized = normalize(word);
      return normalized;
    });
  };

  // Register the pipeline function so the index can be serialised
  lunr.Pipeline.registerFunction(pipelineFunction, 'normalizeLunrPlugin');

  // Add the pipeline function to both the indexing pipeline and the
  // searching pipeline
  builder.pipeline.before(stemmer, pipelineFunction);
  builder.searchPipeline.before(stemmer, pipelineFunction);
};

jigarzon avatar Oct 07 '19 16:10 jigarzon

My suggestion is that two stemmers, with both accented and no-accented words run in the pipeline, so that the word "respiracion" without accents, that the first stemmer will leave intact, is picked by the second one and stemmed correctly...

jigarzon avatar Oct 07 '19 16:10 jigarzon