TextBlob Tokenization incorrectly splits "gonna" into "gon" and "na"

Tokenization incorrectly splits "gonna" into "gon" and "na"

Open whosken opened this issue 9 years ago • 1 comments

Verified that this occurs in 0.10.0 :sob:

>>> import textblob
>>> textblob.TextBlob('gonna do this').words
WordList(['gon', 'na', 'do', 'this'])

Oct 23 '15 05:10 whosken

@whosken this is the standard NLTK (TreeBank) tokenization. You might wanna use NLTK directly for other options.

Feb 03 '17 07:02 ghost