Spark's large language model token calculation error
Self Checks
- [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
Dify version
0.7.2
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
- The first step is to create a chat app
- The second step is to write an 800-word essay about my mother
- Step 3: Check the number of tokens
✔️ Expected Behavior
The actual number of tokens may be several hundred
❌ Actual Behavior
The number of tokens calculated by the platform is only 18
Hello @13006319571! I'm here to assist you with any bugs, questions, or contributions you have for the repository. Let's work together to solve any issues you're encountering.
I found a similar unsolved discussion that might be relevant to your issue: Max Token Limits Incorrect?, which was last updated on 2024-08-14T04:25:50Z [1].
Regarding the token calculation in Dify, the platform uses an approximation based on the GPT-2 tokenizer. The method get_num_tokens sums up the token counts for each text in the input list by calling _get_num_tokens_by_gpt2. Here is the relevant code snippet:
def get_num_tokens(self, model: str, credentials: dict, texts: list[str]) -> int:
"""
Approximate number of tokens for given messages using GPT2 tokenizer
:param model: model name
:param credentials: model credentials
:param texts: texts to embed
:return:
"""
return sum(self._get_num_tokens_by_gpt2(text) for text in texts)
Additionally, Dify provides a method for token calculation in the BaichuanTokenizer class, which can be used for texts containing both Chinese and English characters. This method estimates the number of tokens by counting Chinese characters and English words, applying a specific formula for the estimation:
class BaichuanTokenizer:
@classmethod
def count_chinese_characters(cls, text: str) -> int:
return len(re.findall(r'[\u4e00-\u9fa5]', text))
@classmethod
def count_english_vocabularies(cls, text: str) -> int:
# remove all non-alphanumeric characters but keep spaces and other symbols like !, ., etc.
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# count the number of words not characters
return len(text.split())
@classmethod
def _get_num_tokens(cls, text: str) -> int:
# tokens = number of Chinese characters + number of English words * 1.3 (for estimation only, subject to actual return)
return int(cls.count_chinese_characters(text) + cls.count_english_vocabularies(text) * 1.3)
These methods ensure that the token count is accurate by considering all elements of the input, including any tool or function calls that may be present [2][3][4].
It seems there might be an issue with the token calculation in your specific case. You might want to check if the text is being processed correctly and if the tokenizer is applied as expected.
To continue talking to Dosu, mention @dosu.
This image is broken.
This image is broken.
It is indeed incorrect.
This image is broken.
There may be an error here. https://github.com/langgenius/dify/blob/0cef25ef8c3892e50fb47395a16b8096e7685511/api/core/model_runtime/model_providers/spark/llm/llm.py#L50
Link https://github.com/langgenius/dify/issues/6887