BertTokenizers Strings with linux line endings break the tokenizer

Strings with linux line endings break the tokenizer

Open palenshus opened this issue 1 year ago • 0 comments

This causes an infinite loop:

var _tokenizer = new BertUncasedBaseTokenizer();
var sentence = "Linux\nline\nendings";
var tokens = _tokenizer.Tokenize(sentence);

The problem is that the TokenizeSentence method doesn't have '\n' as a valid token. It also has one space, and three spaces, but not two, for example.

Jun 23 '23 00:06 palenshus

BertTokenizers BertTokenizers copied to clipboard

Strings with linux line endings break the tokenizer

BertTokenizers
BertTokenizers copied to clipboard