classifier Add Cyrilic Support

Add Cyrilic Support

Open kolarski opened this issue 11 years ago • 3 comments

Sadly does not support UTF-8. The problem lies here:

getWords : function(doc) {
    if (_(doc).isArray()) {
      return doc;
    }
    var words = doc.split(/\W+/);
    return _(words).uniq();
  }

doc.split(/\W+/);

does not seem to work for UTF-8

Here is an example with Cyrilic language (like Russian):

"Надежда за обич еп.36 Тест".split(/\W+/);

This returns:

[ "", "36", "" ]

Instead should return something like this:

[ "Надежда", "за", "обич", "еп", "36", "Тест"]

Fix is provided below:

Replace

\/W+\

with

/[^a-zA-ZA-Яa-я0-9_]+/

for cyrilic support.

Jul 16 '14 15:07 kolarski

@kolarski While this fixes your concrete problem, it would be far more scalable to switch to xregexp: https://github.com/slevithan/xregexp#unicode, where you have proper "letter" classes.

Jul 16 '14 16:07 tomayac

Agree, probably the best solution

Jul 17 '14 07:07 kolarski

Thanks for the pull request.

I'm no longer actively maintaining this repo. Try natural's Bayesian classifier for an alternative.

Jul 21 '14 08:07 harthur

classifier classifier copied to clipboard

Add Cyrilic Support

classifier
classifier copied to clipboard