BertTokenizers
BertTokenizers copied to clipboard
Strings with linux line endings break the tokenizer
This causes an infinite loop:
var _tokenizer = new BertUncasedBaseTokenizer();
var sentence = "Linux\nline\nendings";
var tokens = _tokenizer.Tokenize(sentence);
The problem is that the TokenizeSentence
method doesn't have '\n' as a valid token. It also has one space, and three spaces, but not two, for example.