classifier-reborn
classifier-reborn copied to clipboard
In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption.
https://github.com/jekyll/classifier-reborn/blob/4e807496e69cbb33ce2663564ef287f167915879/lib/classifier-reborn/extensions/hasher.rb#L30
We could probably make this configurable. I’ll happily review a PR for this.
@Christophy We just merged https://github.com/jekyll/classifier-reborn/pull/162, which allows for custom tokenizers. Could you let us know if this helps?