BertTokenizers Words surrounded by backwards quotation marks causing inaccurate tokenization results

Words surrounded by backwards quotation marks causing inaccurate tokenization results

Open rghavimi opened this issue 2 years ago • 0 comments

It seems that the occurrence of a backwards quotation marks (“end“) in the text causes different tokenization results compared to Python implementations. This is the only inconsistency I've run into thus far. Curious if anyone else has seen similar issues.

Example: “ends -> tokenizes to ##end and ##s instead of ##ends

Feb 16 '23 02:02 rghavimi

BertTokenizers BertTokenizers copied to clipboard

Words surrounded by backwards quotation marks causing inaccurate tokenization results

BertTokenizers
BertTokenizers copied to clipboard