normalise
normalise copied to clipboard
Warning: Careful using a custom tokenizer...
I tried to use the spaCy tokenizer, nltk word_tokenizer
, sacremoses
MosesTokenizer
, nltk TreebankWordTokenizer
, and nltk TweetTokenizer
.
For this example, "inch BBL, unquote, cost $29.95"
they will all output ['inch', 'BBL', ',', 'unquote', ',', 'cost', '$', '29.95', '.']
. This output is incompatible with normalise
because it'll predict "inch B B L, unquote, cost $twenty nine point nine five."
.