node-rake icon indicating copy to clipboard operation
node-rake copied to clipboard

Non-ascii letters not recognised

Open premasagar opened this issue 6 years ago • 2 comments

For this simple query in Spanish, with empty stopwords (or with stopwords; it doesn't matter):

rake.generate("Cuantos años tienes?", {stopwords: []})

I get the error:

TypeError: Cannot read property 'forEach' of null
    at phraseList.forEach
    at Array.forEach
    at Rake.calculatePhraseScores

If I omit the stopwords, then there is no error, but the word "años" is incorrectly split up:

rake.generate("Cuantos años tienes?")

=> [ 'ños tienes', 'Cuantos' ]

I think the code is treating the ñ as a word-break character, leading to the word being split in the second example, and leading to the single character ñ being used as a whole phrase in the function calculatePhraseScores, which leads to the error in the first example. The wordList regex seems to be looking only for 0-9a-z as acceptable word characters, which will be incomplete.

premasagar avatar May 08 '18 16:05 premasagar

It happens the same when the text has any tildes, meaning: áéíóú

mmanriquezl avatar Aug 24 '18 15:08 mmanriquezl

Late to the party, but on my own fork I'm testing changing this line https://github.com/waseem18/node-rake/blob/123894eec17250810c8ef738e17416254d85376f/index.js#L43 as indicated by @premasagar to this:

phrase.match(/[,.!?;:/‘’“”]|\b[\p{L}\p{M}']+\b/giu);

Which is supports any Unicode letter and some unicode markings, basically making this code work with any language. See Regex Unicode

(Edit: forgot the backspace \p)

fmalk avatar Mar 29 '23 03:03 fmalk