CherubNLP
                                
                                
                                
                                    CherubNLP copied to clipboard
                            
                            
                            
                        Tokenizng "Hello-world"
Hi All,
I am comparing the tokenization of the sentence Hello-world with other NLP libraries
- OpenNLP
 - Google Natural Language (Cloud)
 - nltk(default)
 - nltk(WordPunctTokenizer)
 
I am just trying to get to know more about CherubNLP and the approach it follows. Is there any parameter that would make CherubNLP emit 3 tokens , like Google and OpenNLP-EnglishRuleBasedTokenizer ?
CherubNLP
I get back a single token Hello-world
OpenNLP
I am using the class OpenNLP.Tools.Tokenize.EnglishRuleBasedTokenizer and this gave me 3 tokens
- Hello
 - "-"
 - world
 
Google NLP
https://cloud.google.com/natural-language/ Google gives me 3 tokens.
nltk
nltk.word_tokenize("Hello-world")
['Hello-world']
nltk WordPunctTokenizer
nltk.tokenize.WordPunctTokenizer().tokenize("Hello-world")
['Hello', '-', 'world']
                                    
                                    
                                    
                                
Which Tokenizor are you using? RegexTokenizer or TreebankTokenizer
https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP/Tokenize
Tried with TreebankTokenizer.
RegexTokenizer is throwing an ArgumentNull exception. I guess, I am not using it the right way.
Can you run this UnitTest? https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP.UnitTest/Tokenize
The RegexTokenizer  was able to parse "hello-world".
Unfortunately, it also split 50,000 in the sentence this will cost 50,000 into 50 and 000.
Nevertheless, your efforts are commendable. I think I am asking too much at this moment.
It can be added easily to split digital with commas. You can do it and PR.
这个项目太实用了,但是资料好少啊,我英文也不好,该怎么详细了解一下呢。已经运行成功了,就是不知道怎么该达到我想要的效果
请参考单元测试。
请参考单元测试。
谢谢老大,你的联系方式可以给一个吗,我把单元测试里面的方法都运行了,基本都可以,但是不知道具体实现的是什么功能,英文不好,也大概推测不出来,还有wordvec_enu.bin这个文件,我没下载到。我看了好多nlp的代码,你这个功能最强大,最全,最适合我了。我是着急想全部看通,但是没有文档,我短时间琢磨不透啊。