gpt2-tokenizer-java
gpt2-tokenizer-java copied to clipboard
Chinese character encoding result incorrect
Example: 你好,我叫凯文。 Expected: [19526, 254, 25001, 121, 171, 120, 234, 22755, 239, 20998, 104, 49035, 107, 23877, 229, 16764] Actual: [8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423, 8423]
yes, too many 8423