The implementation of knowledge base PDF parsing using pypdfium2 to e…

Open 309299817 opened this issue 8 months ago • 0 comments

…xtract text mainly has the following issues:

Limited text extraction capability and insufficient support for tables and images
Lack of specialized Chinese processing optimization
No document structure analysis
Lack of document quality assessment Suggested optimization plan:
Use pdfplumber instead of pypdfium2
Increase OCR support
Optimize Chinese processing logic
Add document structure analysis
Implement intelligent table recognition
Add caching mechanism
Optimize large file processing

Summary

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

[!Tip] Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before	After
...	...

Checklist

[!IMPORTANT]
Please review the checklist below before submitting your pull request.

[ ] This change requires a documentation update, included: Dify Document
[x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
[x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
[x] I've updated the documentation accordingly.
[x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Apr 29 '25 03:04 309299817