dify
dify copied to clipboard
The implementation of knowledge base PDF parsing using pypdfium2 to e…
…xtract text mainly has the following issues:
- Limited text extraction capability and insufficient support for tables and images
- Lack of specialized Chinese processing optimization
- No document structure analysis
- Lack of document quality assessment Suggested optimization plan:
- Use pdfplumber instead of pypdfium2
- Increase OCR support
- Optimize Chinese processing logic
- Add document structure analysis
- Implement intelligent table recognition
- Add caching mechanism
- Optimize large file processing
Summary
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
[!Tip] Close issue syntax:
Fixes #<issue number>orResolves #<issue number>, see documentation for more details.
Screenshots
| Before | After |
|---|---|
| ... | ... |
Checklist
[!IMPORTANT]
Please review the checklist below before submitting your pull request.
- [ ] This change requires a documentation update, included: Dify Document
- [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
- [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
- [x] I've updated the documentation accordingly.
- [x] I ran
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods