openai-java
openai-java copied to clipboard
The TikTokensUtil.tokens method does not support Chinese characters
The TikTokensUtil.tokens method does not support Chinese characters. For example, "你好" has 4 tokens, but calling TikTokensUtil.tokens(TikTokensUtil.ModelEnum.GPT_3_5_TURBO.getName(), "你好") only outputs 2 tokens.
https://github.com/knuddelsgmbh/jtokkit
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry(); Encoding enc = registry.getEncoding(EncodingType.P50K_BASE); List<Integer> encoded = enc.encode("你好?");
encoded.size();