The LLM does not get the contents of the document extractor correctly
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
1.0
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
The LLM does not get the contents of the document extractor correctly
{ "role": "user", "text": "文章内容如下:\n["\u5b89\u5168\u751f\u4ea7\u884c\u653f\u6267\u6cd5\u6587\u4e66\r\n\r\n\u8be2\u95ee\u7b14\u5f55\r\n\r\n\u8be2\u95ee\u65f6\u95f4\uff1a2025\u5e741\u67087\u65e5 15\u65f6 30\u5206\u81f316\u65f650\u5206 \u7b2c \r\n\r\n\u95ee\uff1a\u4f60\u770b\u4e00\u4e0b\uff0c\u548c\u4f60\u8bf4\u7684\u662f\u5426\u4e00\u81f4\uff0c\u5982\u4e00\u81f4\u8bf7\u7b7e\u540d\u3002 \r\n\r\n\u7b54\uff1a "]", "files": [] } ], "model_provider": "langgenius/ollama/ollama", "model_name": "deepseek-r1:14b" }
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
@Hunter6324 Hello there! I'm here to help you with bugs, questions, and becoming a contributor. Let's squash those bugs together! 🤖
I found a similar open issue that might be related to your problem. The issue suggests that the problem could be linked to the text extraction process using the pypdfium2 library. It recommends checking the integrity of the PDF file to ensure it's not corrupted. Additionally, it advises reviewing the _extract_text_from_pdf function in the DocumentExtractorNode class to see if any improvements or custom handling can be applied for your specific use case [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
The document extractor enters the LLM with Unicode escape characters instead of text, causing the LLM to fail to parse the user-uploaded document