openai-java icon indicating copy to clipboard operation
openai-java copied to clipboard

The TikTokensUtil.tokens method does not support Chinese characters

Open Richard66666666 opened this issue 2 years ago • 2 comments

The TikTokensUtil.tokens method does not support Chinese characters. For example, "你好" has 4 tokens, but calling TikTokensUtil.tokens(TikTokensUtil.ModelEnum.GPT_3_5_TURBO.getName(), "你好") only outputs 2 tokens.

Richard66666666 avatar Aug 08 '23 02:08 Richard66666666

https://github.com/knuddelsgmbh/jtokkit

aaronuu avatar Aug 17 '23 12:08 aaronuu

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry(); Encoding enc = registry.getEncoding(EncodingType.P50K_BASE); List<Integer> encoded = enc.encode("你好?");

encoded.size();

aaronuu avatar Aug 18 '23 00:08 aaronuu