normalise icon indicating copy to clipboard operation
normalise copied to clipboard

Warning: Careful using a custom tokenizer...

Open PetrochukM opened this issue 4 years ago • 0 comments

I tried to use the spaCy tokenizer, nltk word_tokenizer, sacremoses MosesTokenizer, nltk TreebankWordTokenizer, and nltk TweetTokenizer.

For this example, "inch BBL, unquote, cost $29.95" they will all output ['inch', 'BBL', ',', 'unquote', ',', 'cost', '$', '29.95', '.']. This output is incompatible with normalise because it'll predict "inch B B L, unquote, cost $twenty nine point nine five.".

PetrochukM avatar Sep 15 '20 02:09 PetrochukM