Differences in Length Calculation Between English and Chinese Characters in the Knowledge Base
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
v0.15.2
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
Hello,
While maintaining the content of the knowledge base through the API (specifically using the create_by_text method and updating content according to the hierarchical_model), I discovered that the calculation methods for the lengths of Chinese and English content are different. I am not sure if this is a bug, but I have prepared complete test cases for both scenarios.
When setting max_tokens to 2000, I found that Chinese content gets truncated during upload, whereas English content does not get truncated. Since I am calculating the length at the character level using Python's len() function, theoretically, this length should not be truncated unless a Chinese character is being split and calculated in parts.
✔️ Expected Behavior
sub_block = 'length 10\n'
len(sub_block) # 10
full_block = sub_block*190
print(f"full_block lenght:{len(full_block)}") # full_block lenght:1900
import requests
headers = {
'Authorization': 'Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP',
'Content-Type': 'application/json'
}
base_url = f'http://ip/v1/datasets/7bb72914-d1ed-4422-957c-56bb25ac8170'
# parent_mode
def create_document_by_text(name):
url = f'{base_url}/document/create_by_text'
data = {
'name': name,
'text': full_block,
'indexing_technique': 'high_quality',
'doc_form': 'hierarchical_model', # parent_mode
'process_rule': {
'mode': 'hierarchical',
'rules': {
"pre_processing_rules": [
{"id": "remove_extra_spaces", "enabled": False},
{"id": "remove_urls_emails", "enabled": False}
],
'segmentation': {
'separator': '<s>',
'max_tokens': 2000
},
'parent_mode': 'paragraph', #
'subchunk_segmentation': {
'separator': '\n',
'max_tokens': 200,
}
}
}
}
response = requests.post(url, json=data, headers=headers)
return response.json()
# Test API
rc = create_document_by_text('test-doc')
document_id = rc['document']['id']
document_id
❌ Actual Behavior
sub_block = '一二三四五六七八九\n'
len(sub_block) # 10
full_block = sub_block*190
print(f"full_block lenght:{len(full_block)}") # full_block lenght:1900
import requests
headers = {
'Authorization': 'Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP',
'Content-Type': 'application/json'
}
base_url = f'http://ip/v1/datasets/7bb72914-d1ed-4422-957c-56bb25ac8170'
# parent_mode
def create_document_by_text(name):
url = f'{base_url}/document/create_by_text'
data = {
'name': name,
'text': full_block,
'indexing_technique': 'high_quality',
'doc_form': 'hierarchical_model', # parent_mode
'process_rule': {
'mode': 'hierarchical',
'rules': {
"pre_processing_rules": [
{"id": "remove_extra_spaces", "enabled": False},
{"id": "remove_urls_emails", "enabled": False}
],
'segmentation': {
'separator': '<s>',
'max_tokens': 2000
},
'parent_mode': 'paragraph', #
'subchunk_segmentation': {
'separator': '\n',
'max_tokens': 200,
}
}
}
}
response = requests.post(url, json=data, headers=headers)
return response.json()
# Test API
rc = create_document_by_text('test-doc')
document_id = rc['document']['id']
document_id
The issue you're experiencing with the different handling of Chinese and English content in Dify's create_by_text method might be related to how tokens are calculated for different languages. In many natural language processing systems, tokenization can vary significantly between languages due to differences in character sets and word boundaries.
For English, tokenization often aligns closely with word boundaries, while for Chinese, tokenization might involve more complex segmentation due to the lack of spaces between characters. This can lead to differences in how the max_tokens parameter is applied, potentially causing Chinese content to be truncated more frequently than English content.
The max_tokens parameter is designed to limit the number of tokens processed, and if the tokenization method used by Dify treats Chinese characters differently, this could explain the truncation you're observing. Unfortunately, I couldn't find specific details in the Dify documentation about how tokenization is handled for different languages, including Chinese and English.
To address this, you might consider:
-
Checking Tokenization Settings: Review any available settings or configurations in Dify that might affect tokenization, especially for Chinese text.
-
Adjusting
max_tokens: Experiment with differentmax_tokensvalues to see if a higher limit reduces truncation for Chinese content. -
Custom Tokenization: If possible, implement a custom tokenization strategy that better suits your needs for Chinese text.
If these steps don't resolve the issue, further investigation into the specific tokenization library or method used by Dify might be necessary.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
@dosu A key question is that the number of tokens should be less than or equal to the number of characters, regardless of the tokenization strategy. My understanding is that the length after tokenization will definitely be less than the original character length. In other words, the character length should be the upper limit of the tokenized length.
In Dify, the tokenization process for Chinese and English text is handled differently, which can affect the number of tokens calculated. Specifically, the BaichuanTokenizer class in Dify calculates tokens by counting Chinese characters and English words separately. The method _get_num_tokens adds the number of Chinese characters to 1.3 times the number of English words to estimate the total number of tokens [1].
This means that for English text, the token count might exceed the character count due to the multiplication factor applied to English words. For Chinese text, each character is typically counted as a single token, which aligns more closely with the character count.
Therefore, the tokenized length can sometimes exceed the original character length, especially for English text, due to the way tokens are estimated. This is why the character length is not always the upper limit of the tokenized length.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
@dosu Interestingly, the problem is the opposite of what one might expect. In my test, 1900 Chinese characters (“一二三四五六七八九\n”*190) were truncated. As observed from my post, the segment length retained by the dify system was 1249. It appears that the actual length was calculated at 1.6 times the original length, which is quite puzzling.
The behavior you're observing, where 1900 Chinese characters are truncated to 1249, suggests that the tokenization process in Dify might be applying a multiplier or additional logic that results in a higher token count than expected. While the _get_num_tokens method in the BaichuanTokenizer class calculates tokens by counting Chinese characters directly, there might be other factors or methods influencing this outcome.
One possibility is that there are additional processing steps or conditions in the Dify codebase that affect the token count for Chinese text, leading to the observed 1.6x increase. Unfortunately, I couldn't find specific details in the repository about such logic or conditions that would explain this exact behavior.
To further investigate, you might want to:
- Review Related Code: Check for any additional methods or classes interacting with
BaichuanTokenizerthat might influence token count calculations. - Examine Configuration: Look for any configuration settings that might affect tokenization, especially for Chinese text.
- Debugging: Consider adding logging or debugging statements in the tokenization process to trace how the token count is being calculated.
These steps might help identify the source of the discrepancy in token count for Chinese text.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
Currently, I am indeed very menbility(puzzled). I have tested two more cases. In both cases, I used the bge-large-zh model for tokenization tests, and they were both tokenized at the character level separately. However, after uploading to the dify knowledge base, one case had a normal token count of 1900, while the other surged to 4180. Has anyone encountered a similar issue?
Right example:
Bad example:
Hi, @tigflanker. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.
Issue Summary
- Discrepancy in tokenization between English and Chinese using
create_by_text. BaichuanTokenizercalculates tokens differently for Chinese, causing unexpected truncation.- You noted a 1.6x increase in token count for Chinese text and provided test cases.
- Seeking further insights or similar experiences from others.
Next Steps
- Is this issue still relevant to the latest version of the Dify repository? If so, please comment to keep the discussion open.
- Otherwise, this issue will be automatically closed in 15 days.
Thank you for your understanding and contribution!