dify icon indicating copy to clipboard operation
dify copied to clipboard

The implementation of knowledge base PDF parsing using pypdfium2 to e…

Open 309299817 opened this issue 8 months ago • 0 comments

…xtract text mainly has the following issues:

  1. Limited text extraction capability and insufficient support for tables and images
  2. Lack of specialized Chinese processing optimization
  3. No document structure analysis
  4. Lack of document quality assessment Suggested optimization plan:
  5. Use pdfplumber instead of pypdfium2
  6. Increase OCR support
  7. Optimize Chinese processing logic
  8. Add document structure analysis
  9. Implement intelligent table recognition
  10. Add caching mechanism
  11. Optimize large file processing

Summary

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

[!Tip] Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before After
... ...

Checklist

[!IMPORTANT]
Please review the checklist below before submitting your pull request.

  • [ ] This change requires a documentation update, included: Dify Document
  • [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • [x] I've updated the documentation accordingly.
  • [x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

309299817 avatar Apr 29 '25 03:04 309299817