dify icon indicating copy to clipboard operation
dify copied to clipboard

Differences in Length Calculation Between English and Chinese Characters in the Knowledge Base

Open tigflanker opened this issue 10 months ago • 7 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

v0.15.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Hello,

While maintaining the content of the knowledge base through the API (specifically using the create_by_text method and updating content according to the hierarchical_model), I discovered that the calculation methods for the lengths of Chinese and English content are different. I am not sure if this is a bug, but I have prepared complete test cases for both scenarios.

When setting max_tokens to 2000, I found that Chinese content gets truncated during upload, whereas English content does not get truncated. Since I am calculating the length at the character level using Python's len() function, theoretically, this length should not be truncated unless a Chinese character is being split and calculated in parts.

✔️ Expected Behavior

sub_block = 'length 10\n'
len(sub_block) # 10

full_block = sub_block*190
print(f"full_block lenght:{len(full_block)}") # full_block lenght:1900

import requests

headers = {
    'Authorization': 'Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP',
    'Content-Type': 'application/json'
}

base_url = f'http://ip/v1/datasets/7bb72914-d1ed-4422-957c-56bb25ac8170'

# parent_mode
def create_document_by_text(name):
    url = f'{base_url}/document/create_by_text'
    data = {
        'name': name,
        'text': full_block,
        'indexing_technique': 'high_quality',
        'doc_form': 'hierarchical_model',  # parent_mode
        'process_rule': {
            'mode': 'hierarchical',
            'rules': {
                "pre_processing_rules": [
                    {"id": "remove_extra_spaces", "enabled": False},
                    {"id": "remove_urls_emails", "enabled": False}
                ],
                'segmentation': {
                    'separator': '<s>',
                    'max_tokens': 2000
                },
                'parent_mode': 'paragraph',  # 
                'subchunk_segmentation': {
                    'separator': '\n',
                    'max_tokens': 200,
                }
            }
        }
    }

    response = requests.post(url, json=data, headers=headers)
    return response.json()

# Test API
rc = create_document_by_text('test-doc')
document_id = rc['document']['id']
document_id

Image

❌ Actual Behavior

sub_block = '一二三四五六七八九\n'
len(sub_block) # 10

full_block = sub_block*190
print(f"full_block lenght:{len(full_block)}") # full_block lenght:1900

import requests

headers = {
    'Authorization': 'Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP',
    'Content-Type': 'application/json'
}

base_url = f'http://ip/v1/datasets/7bb72914-d1ed-4422-957c-56bb25ac8170'

# parent_mode
def create_document_by_text(name):
    url = f'{base_url}/document/create_by_text'
    data = {
        'name': name,
        'text': full_block,
        'indexing_technique': 'high_quality',
        'doc_form': 'hierarchical_model',  # parent_mode
        'process_rule': {
            'mode': 'hierarchical',
            'rules': {
                "pre_processing_rules": [
                    {"id": "remove_extra_spaces", "enabled": False},
                    {"id": "remove_urls_emails", "enabled": False}
                ],
                'segmentation': {
                    'separator': '<s>',
                    'max_tokens': 2000
                },
                'parent_mode': 'paragraph',  # 
                'subchunk_segmentation': {
                    'separator': '\n',
                    'max_tokens': 200,
                }
            }
        }
    }

    response = requests.post(url, json=data, headers=headers)
    return response.json()

# Test API
rc = create_document_by_text('test-doc')
document_id = rc['document']['id']
document_id

Image

tigflanker avatar Feb 06 '25 03:02 tigflanker

The issue you're experiencing with the different handling of Chinese and English content in Dify's create_by_text method might be related to how tokens are calculated for different languages. In many natural language processing systems, tokenization can vary significantly between languages due to differences in character sets and word boundaries.

For English, tokenization often aligns closely with word boundaries, while for Chinese, tokenization might involve more complex segmentation due to the lack of spaces between characters. This can lead to differences in how the max_tokens parameter is applied, potentially causing Chinese content to be truncated more frequently than English content.

The max_tokens parameter is designed to limit the number of tokens processed, and if the tokenization method used by Dify treats Chinese characters differently, this could explain the truncation you're observing. Unfortunately, I couldn't find specific details in the Dify documentation about how tokenization is handled for different languages, including Chinese and English.

To address this, you might consider:

  1. Checking Tokenization Settings: Review any available settings or configurations in Dify that might affect tokenization, especially for Chinese text.

  2. Adjusting max_tokens: Experiment with different max_tokens values to see if a higher limit reduces truncation for Chinese content.

  3. Custom Tokenization: If possible, implement a custom tokenization strategy that better suits your needs for Chinese text.

If these steps don't resolve the issue, further investigation into the specific tokenization library or method used by Dify might be necessary.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 06 '25 03:02 dosubot[bot]

@dosu A key question is that the number of tokens should be less than or equal to the number of characters, regardless of the tokenization strategy. My understanding is that the length after tokenization will definitely be less than the original character length. In other words, the character length should be the upper limit of the tokenized length.

tigflanker avatar Feb 06 '25 03:02 tigflanker

In Dify, the tokenization process for Chinese and English text is handled differently, which can affect the number of tokens calculated. Specifically, the BaichuanTokenizer class in Dify calculates tokens by counting Chinese characters and English words separately. The method _get_num_tokens adds the number of Chinese characters to 1.3 times the number of English words to estimate the total number of tokens [1].

This means that for English text, the token count might exceed the character count due to the multiplication factor applied to English words. For Chinese text, each character is typically counted as a single token, which aligns more closely with the character count.

Therefore, the tokenized length can sometimes exceed the original character length, especially for English text, due to the way tokens are estimated. This is why the character length is not always the upper limit of the tokenized length.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 06 '25 03:02 dosubot[bot]

@dosu Interestingly, the problem is the opposite of what one might expect. In my test, 1900 Chinese characters (“一二三四五六七八九\n”*190) were truncated. As observed from my post, the segment length retained by the dify system was 1249. It appears that the actual length was calculated at 1.6 times the original length, which is quite puzzling.

tigflanker avatar Feb 06 '25 03:02 tigflanker

The behavior you're observing, where 1900 Chinese characters are truncated to 1249, suggests that the tokenization process in Dify might be applying a multiplier or additional logic that results in a higher token count than expected. While the _get_num_tokens method in the BaichuanTokenizer class calculates tokens by counting Chinese characters directly, there might be other factors or methods influencing this outcome.

One possibility is that there are additional processing steps or conditions in the Dify codebase that affect the token count for Chinese text, leading to the observed 1.6x increase. Unfortunately, I couldn't find specific details in the repository about such logic or conditions that would explain this exact behavior.

To further investigate, you might want to:

  1. Review Related Code: Check for any additional methods or classes interacting with BaichuanTokenizer that might influence token count calculations.
  2. Examine Configuration: Look for any configuration settings that might affect tokenization, especially for Chinese text.
  3. Debugging: Consider adding logging or debugging statements in the tokenization process to trace how the token count is being calculated.

These steps might help identify the source of the discrepancy in token count for Chinese text.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 06 '25 03:02 dosubot[bot]

Currently, I am indeed very menbility(puzzled). I have tested two more cases. In both cases, I used the bge-large-zh model for tokenization tests, and they were both tokenized at the character level separately. However, after uploading to the dify knowledge base, one case had a normal token count of 1900, while the other surged to 4180. Has anyone encountered a similar issue?

Right example:

Image

Bad example:

Image

tigflanker avatar Feb 06 '25 06:02 tigflanker

Hi, @tigflanker. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • Discrepancy in tokenization between English and Chinese using create_by_text.
  • BaichuanTokenizer calculates tokens differently for Chinese, causing unexpected truncation.
  • You noted a 1.6x increase in token count for Chinese text and provided test cases.
  • Seeking further insights or similar experiences from others.

Next Steps

  • Is this issue still relevant to the latest version of the Dify repository? If so, please comment to keep the discussion open.
  • Otherwise, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Mar 09 '25 16:03 dosubot[bot]