BertTokenizers icon indicating copy to clipboard operation
BertTokenizers copied to clipboard

[BUG] BertUncasedBaseTokenizer ran forever with input "SixGe1−xH"

Open darren-zdc opened this issue 8 months ago • 2 comments

The Tokenizer is not working when working with input text "SixGe1−xH".

I have looked into the source code. The while loop inside TokenizeSubwords runs forever and never stops.

It can be reproduced by simply running the following unit test.

[Fact]
public void Tokenize_sentence()
{
    var sentence = "SixGe1−xH";

    var tokens = _tokenizer.Tokenize(sentence);
    Assert.Equal(3, tokens.Count);
}

darren-zdc avatar Jun 20 '24 20:06 darren-zdc