Unicode Escape Characters in LLM Prompt Increase Token Consumption
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
v0.15.2
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
I followed the tutorial to set up a chatflow that supports document uploads. After uploading a file, the document parser correctly processed the Chinese text in the document, and the display was correct.
The test file was a DOCX document containing 540 Chinese characters and a total of 1213 characters.
However, when passing the parsed result to the next node (LLM) in the system prompt, the parsed content was converted into Unicode escape characters.
Although the LLM can understand these escape characters, this significantly increases the token consumption (2137 tokens).
I compared this with another AI workflow tool, fastgpt, which parsed the same document but sent the correct Chinese characters to the LLM node. Additionally, the token consumption was much lower (514 tokens, only 24% of Dify's consumption).
I also tested a PDF document with over 6000 Chinese characters and a total of over 16000 characters. Dify's Unicode escape character token consumption was 29364, while Fastgpt's token consumption was 7991 (27.2% of Dify's consumption).
I have not tested other non-ASCII languages and am unsure if there is a similar issue with other languages.
I browsed through other similar closed issues (#5010 #7040 #8056), which mention Unicode escape problems in other areas, but I'm unsure whether this issue will occur in other parts of the system.
✔️ Expected Behavior
Like fastgpt, the system should send the correct Chinese characters instead of Unicode escape characters to reduce token consumption.
❌ Actual Behavior
No response
@LinLin00000000 Hello there! I'm here to help you with any bugs, questions, or contributions you have in mind. Let's work together to solve the issue you've encountered.
I found a similar open issue related to Chinese character encoding in Dify. The suggested solution is to ensure that JSON encoding and decoding operations handle non-ASCII characters correctly by using the ensure_ascii=False parameter in the json.dumps function. This parameter ensures that the JSON output includes non-ASCII characters as they are, rather than escaping them. Additionally, ensure that all text fields in configuration files support Chinese characters [1].
In your case, it seems that the conversion to Unicode escape characters is increasing token consumption significantly. You might want to check if the ensure_ascii parameter is being used in your setup and adjust it accordingly to prevent the conversion of Chinese characters to Unicode escape sequences. This could help reduce the token consumption when processing prompts with non-ASCII characters.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
兄弟,你解决了吗
兄弟,你解决了吗
并没有,等待官方解决吧 No, wait for the official solution
Not only increate the total input token, but also the process time of the LLM. A rough estimate is that it will take 2-3 times longer than before.
Not only that. Chinese characters are sometimes turned to unicode escape when being passed as a variable to HTTP request node. This makes stuff complicated in many situations,
快一个月过去了,解决了吗
快一个月过去了,解决了吗
nope, I have migrated to v1.0.0, yet same old story...
#13860 #14673
Two new issues have been raised regarding this problem. 🥲
guys, thx to the dify team, they fix this problem, check this https://github.com/langgenius/dify/pull/15318
Hi, @LinLin00000000. I'm Dosu, and I'm helping the Dify team manage their backlog and am marking this issue as stale.
Issue Summary
- You reported an issue with Unicode escape characters in LLM prompts in Dify v0.15.2.
- This led to increased token consumption and affected processing time and HTTP requests.
- I suggested using
ensure_ascii=Falsein JSON operations. - Related issues #13860 and #14673 were mentioned.
- The issue has been resolved in a recent pull request, as confirmed by allensavage.
Next Steps
- Please confirm if this issue is still relevant to the latest version of Dify. If so, you can keep the discussion open by commenting here.
- Otherwise, this issue will be automatically closed in 15 days.
Thank you for your understanding and contribution!