dify icon indicating copy to clipboard operation
dify copied to clipboard

The LLM does not get the contents of the document extractor correctly

Open Hunter6324 opened this issue 10 months ago • 2 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

1.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

The LLM does not get the contents of the document extractor correctly

{ "role": "user", "text": "文章内容如下:\n["\u5b89\u5168\u751f\u4ea7\u884c\u653f\u6267\u6cd5\u6587\u4e66\r\n\r\n\u8be2\u95ee\u7b14\u5f55\r\n\r\n\u8be2\u95ee\u65f6\u95f4\uff1a2025\u5e741\u67087\u65e5 15\u65f6 30\u5206\u81f316\u65f650\u5206 \u7b2c \r\n\r\n\u95ee\uff1a\u4f60\u770b\u4e00\u4e0b\uff0c\u548c\u4f60\u8bf4\u7684\u662f\u5426\u4e00\u81f4\uff0c\u5982\u4e00\u81f4\u8bf7\u7b7e\u540d\u3002 \r\n\r\n\u7b54\uff1a "]", "files": [] } ], "model_provider": "langgenius/ollama/ollama", "model_name": "deepseek-r1:14b" }

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Hunter6324 avatar Mar 02 '25 11:03 Hunter6324

@Hunter6324 Hello there! I'm here to help you with bugs, questions, and becoming a contributor. Let's squash those bugs together! 🤖

I found a similar open issue that might be related to your problem. The issue suggests that the problem could be linked to the text extraction process using the pypdfium2 library. It recommends checking the integrity of the PDF file to ensure it's not corrupted. Additionally, it advises reviewing the _extract_text_from_pdf function in the DocumentExtractorNode class to see if any improvements or custom handling can be applied for your specific use case [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Mar 02 '25 11:03 dosubot[bot]

The document extractor enters the LLM with Unicode escape characters instead of text, causing the LLM to fail to parse the user-uploaded document

Hunter6324 avatar Mar 02 '25 11:03 Hunter6324