gpt2-tokenizer-java icon indicating copy to clipboard operation
gpt2-tokenizer-java copied to clipboard

Chinese character encoding result incorrect

Open an9xyz opened this issue 2 years ago • 1 comments

Example: 你好,我叫凯文。 Expected: [19526, 254, 25001, 121, 171, 120, 234, 22755, 239, 20998, 104, 49035, 107, 23877, 229, 16764] Actual: [8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423]

an9xyz avatar Feb 09 '23 08:02 an9xyz

yes, too many 8423

xixixi2000 avatar Apr 14 '23 12:04 xixixi2000