BertTokenizers
BertTokenizers copied to clipboard
[BUG] BertUncasedBaseTokenizer ran forever with input "SixGe1−xH"
The Tokenizer
is not working when working with input text "SixGe1−xH".
I have looked into the source code. The while loop inside TokenizeSubwords
runs forever and never stops.
It can be reproduced by simply running the following unit test.
[Fact]
public void Tokenize_sentence()
{
var sentence = "SixGe1−xH";
var tokens = _tokenizer.Tokenize(sentence);
Assert.Equal(3, tokens.Count);
}