ragflow
ragflow copied to clipboard
[Bug]:Parts of content missing for PDF
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch name
deepdoc
Commit ID
sdf234cdwsdfwer
Other environment information
No response
Actual behavior
pdf识别文字会缺失内容,
(整理)兰炭工程施工的难点识别及监理实施办法.(1).pdf
Expected behavior
解析失败
Steps to reproduce
dfasdfds
Additional information
No response
You could try it with 'manual' or 'laws ' chunking method.
I tried it with 'General' method on the demo. It works fine.
I'm using the General mode with a block size of 2048. When dividing the file above, some things are lost. Please use the file above for testing.
删除 layout_recognizer.py 里面的 garbage_layouts即可,作者对有些布局做了屏蔽