dify Unicode Escape Characters in LLM Prompt Increase Token Consumption

Self Checks

[x] This is only for bug report, if you would like to ask a question, please head to Discussions.
[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[x] Please do not modify this template :) and fill in all the required fields.

Dify version

v0.15.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I followed the tutorial to set up a chatflow that supports document uploads. After uploading a file, the document parser correctly processed the Chinese text in the document, and the display was correct.

The test file was a DOCX document containing 540 Chinese characters and a total of 1213 characters.

test1.docx

However, when passing the parsed result to the next node (LLM) in the system prompt, the parsed content was converted into Unicode escape characters.

Although the LLM can understand these escape characters, this significantly increases the token consumption (2137 tokens).

I compared this with another AI workflow tool, fastgpt, which parsed the same document but sent the correct Chinese characters to the LLM node. Additionally, the token consumption was much lower (514 tokens, only 24% of Dify's consumption).

I also tested a PDF document with over 6000 Chinese characters and a total of over 16000 characters. Dify's Unicode escape character token consumption was 29364, while Fastgpt's token consumption was 7991 (27.2% of Dify's consumption).

I have not tested other non-ASCII languages and am unsure if there is a similar issue with other languages.

I browsed through other similar closed issues (#5010 #7040 #8056), which mention Unicode escape problems in other areas, but I'm unsure whether this issue will occur in other parts of the system.

✔️ Expected Behavior

Like fastgpt, the system should send the correct Chinese characters instead of Unicode escape characters to reduce token consumption.

❌ Actual Behavior

No response

Feb 05 '25 03:02 LinLin00000000

@LinLin00000000 Hello there! I'm here to help you with any bugs, questions, or contributions you have in mind. Let's work together to solve the issue you've encountered.

I found a similar open issue related to Chinese character encoding in Dify. The suggested solution is to ensure that JSON encoding and decoding operations handle non-ASCII characters correctly by using the ensure_ascii=False parameter in the json.dumps function. This parameter ensures that the JSON output includes non-ASCII characters as they are, rather than escaping them. Additionally, ensure that all text fields in configuration files support Chinese characters ^[1].

In your case, it seems that the conversion to Unicode escape characters is increasing token consumption significantly. You might want to check if the ensure_ascii parameter is being used in your setup and adjust it accordingly to prevent the conversion of Chinese characters to Unicode escape sequences. This could help reduce the token consumption when processing prompts with non-ASCII characters.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Feb 05 '25 03:02 dosubot[bot]

兄弟，你解决了吗

Feb 09 '25 13:02 1999single

兄弟，你解决了吗

并没有，等待官方解决吧 No, wait for the official solution

Feb 09 '25 14:02 LinLin00000000

Not only increate the total input token, but also the process time of the LLM. A rough estimate is that it will take 2-3 times longer than before.

Feb 10 '25 08:02 allensavage

Not only that. Chinese characters are sometimes turned to unicode escape when being passed as a variable to HTTP request node. This makes stuff complicated in many situations,

Feb 27 '25 03:02 Zephyr69

快一个月过去了，解决了吗

Mar 05 '25 08:03 miRUAcleInjoker

快一个月过去了，解决了吗

nope, I have migrated to v1.0.0, yet same old story...

Mar 05 '25 13:03 allensavage

#13860 #14673

Two new issues have been raised regarding this problem. 🥲

Mar 05 '25 13:03 LinLin00000000

guys, thx to the dify team, they fix this problem, check this https://github.com/langgenius/dify/pull/15318

Mar 12 '25 07:03 allensavage

Hi, @LinLin00000000. I'm Dosu, and I'm helping the Dify team manage their backlog and am marking this issue as stale.

Issue Summary

You reported an issue with Unicode escape characters in LLM prompts in Dify v0.15.2.
This led to increased token consumption and affected processing time and HTTP requests.
I suggested using ensure_ascii=False in JSON operations.
Related issues #13860 and #14673 were mentioned.
The issue has been resolved in a recent pull request, as confirmed by allensavage.

Next Steps

Please confirm if this issue is still relevant to the latest version of Dify. If so, you can keep the discussion open by commenting here.
Otherwise, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

Apr 12 '25 16:04 dosubot[bot]