lunr-languages icon indicating copy to clipboard operation
lunr-languages copied to clipboard

Not able to search for just numbers in lunr.de

Open cadamini opened this issue 3 years ago • 5 comments

Probem

In my German and English test documents I have content with the term Port 1234, but searching for 1234 does not work.

Has someone seen the same or a similar problem? Any ideas?

More tests

  • Searching for Port 1234 works fine.
  • Searching for 1234 in an English document works fine, using the base lunr.js version 2.3.8
  • Using this.use(lunr.de) makes it possible to find German umlauts but no numbers anymore.

Test Code

// The required JS files are correctly inserted in the sites head

var idx = lunr(function () {
  this.use(lunr.de)
  this.ref('id')
  this.field('text')

  this.add({
    id: 1,
    text: "Port 1234 is a good port for testing a problem"
  })
})

console.log(idx.search('1234'));
console.log(idx.search('Port 1234'));

Result

Bildschirmfoto 2020-09-12 um 23 58 43

cadamini avatar Sep 12 '20 21:09 cadamini

My Russian docs have the same problem (I use mkdosc, if it have mean)

andrewzola avatar Oct 06 '20 11:10 andrewzola

Related to the trimmer. If I remove the trimmer completely, it works.

The defined word character defined in line 74 were really strange:

lunr.de.wordCharacters = "A-Za-z\xAA\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02B8\u02E0-\u02E4\u1D00-\u1D25\u1D2C-\u1D5C\u1D62-\u1D65\u1D6B-\u1D77\u1D79-\u1DBE\u1E00-\u1EFF\u2071\u207F\u2090-\u209C\u212A\u212B\u2132\u214E\u2160-\u2188\u2C60-\u2C7F\uA722-\uA787\uA78B-\uA7AD\uA7B0-\uA7B7\uA7F7-\uA7FF\uAB30-\uAB5A\uAB5C-\uAB64\uFB00-\uFB06\uFF21-\uFF3A\uFF41-\uFF5A";

Translates to:

ʸˠ ˤᴀ ᴥᴬ ᵜᵢ ᵥᵫ ᵷᵹ ᶾḀ ỿⁱⁿₐ ₜKÅℲⅎⅠ ↈⱠ ⱿꜢ ꞇꞋ ꞭꞰ ꞷꟷ ꟿꬰ ꭚꭜ ꭤff stA Za z

Potential solution:

lunr.de.wordCharacters = "A-Za-züÜÄäÖöß0-9";

cadamini avatar Dec 08 '20 12:12 cadamini

lunr.de.wordCharacters = "A-Za-züÜÄäÖöß0-9";

I noticed the German support was also breaking * wildcard support, this also fixes that.

khawkins98 avatar May 26 '21 10:05 khawkins98

I was facing the same issue No results for numeric searches. Then I found that adding '\0-9' at the end of line 74 that will include numeric searching.

lunr.es.wordCharacters = "A-Za-z\xAA\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02B8\u02E0-\u02E4\u1D00-\u1D25\u1D2C-\u1D5C\u1D62-\u1D65\u1D6B-\u1D77\u1D79-\u1DBE\u1E00-\u1EFF\u2071\u207F\u2090-\u209C\u212A\u212B\u2132\u214E\u2160-\u2188\u2C60-\u2C7F\uA722-\uA787\uA78B-\uA7AD\uA7B0-\uA7B7\uA7F7-\uA7FF\uAB30-\uAB5A\uAB5C-\uAB64\uFB00-\uFB06\uFF21-\uFF3A\uFF41-\uFF5A\0-9";

I think it could be a config option in the future.

pizaranha avatar Aug 19 '22 19:08 pizaranha

As of ES6 regexp in JavaScript now supports the unicode flag, so pretty sure this can be used to simplify the trimmer function for all languages when creating the search index. Some of the language implementations seem to use the trimmer during search too, so it may not work for that. [Here is an example in regex101] (https://regex101.com/r/pQMvFL/1) , it works for latin and non-latin character languages. Have just implemented this in an Angular 14 site to clean the start and end of the search term before executing search. I hardcoded it into the lunr.trimmerSupport.generateTrimmer function and ran the tests and all of them seem to pass, so that's a sign it will work.

@MihaiValentin I can put this into a PR if you like, but obviously being ES6 it is probably not as backwards compatible as what is currently there

blackwidow207 avatar Nov 17 '22 02:11 blackwidow207