ml-classify-text-js
ml-classify-text-js copied to clipboard
Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop
The original tokenize
method assembled a sequence
of words using the space character as a delimiter, then counted tokens using split()
on the space character. This fails if there are spaces in the words passed to the tokenizer. This is not normally the case when used with a string or array of words, but in the case where you want to use single characters as your tokens, spaces should be supported. This allows cosine comparisons to be done on character arrays and ensures results match other cosine distance implementations.
The new test added demonstrates this (passes with this new code, fails with the original).
Additionally, the original loop that counts occurrences of each nGram looped through all words in every segment. This could be quite inefficient. This version stops looping once nGramMax tokens have been counted and is up to 10x faster when comparing arrays of around 100 words.
This is my first time using github to fork and make a pull request, so please let me know if I have done anything wrong or if you have any questions about the changes.
Hi @colin-brown, just updating this to let you know your pull request has been noted and is very appreciated! I'm currently working on a new version of the library with some other updates to the tokenizer and will be reviewing your contribution and try to merge this with the new version as soon as possible.
If I have any questions while reviewing and/or merging with my other updates, I will let you know in this thread!
Once again, thank you for your contribution, it's incredibly fun to see how much interest this project has generated, something I didn't expect while building the first version!
Hi @andreekeberg ah great, let me know if I can be of any assistance. The unit tests I added should tell you whether the support for spaces is working.