classifier
classifier copied to clipboard
Add Cyrilic Support
Sadly does not support UTF-8. The problem lies here:
getWords : function(doc) {
if (_(doc).isArray()) {
return doc;
}
var words = doc.split(/\W+/);
return _(words).uniq();
}
doc.split(/\W+/);
does not seem to work for UTF-8
Here is an example with Cyrilic language (like Russian):
"Надежда за обич еп.36 Тест".split(/\W+/);
This returns:
[ "", "36", "" ]
Instead should return something like this:
[ "Надежда", "за", "обич", "еп", "36", "Тест"]
Fix is provided below:
Replace
\/W+\
with
/[^a-zA-ZA-Яa-я0-9_]+/
for cyrilic support.
@kolarski While this fixes your concrete problem, it would be far more scalable to switch to xregexp: https://github.com/slevithan/xregexp#unicode, where you have proper "letter" classes.
Agree, probably the best solution
Thanks for the pull request.
I'm no longer actively maintaining this repo. Try natural's Bayesian classifier for an alternative.