[Bug]: Deepdoc doesn’t handle tri-column formatted PDFs very well
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
5d5dbb3
RAGFlow image version
0.18
Other environment information
Actual behavior
- The content from the three columns is intermingled
- pls refer the snapshot
Expected behavior
- Correctly recognize content in three-column layouts.
Steps to reproduce
- Upload pdf
- Process 'General' parse with deepdoc
Additional information
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
0.18
Other environment information
Actual behavior
- The content from the three columns is intermingled
- pls refer the snapshot
Expected behavior
- Correctly recognize content in three-column layouts.
Steps to reproduce
- Upload pdf
- Process 'General' parse with deepdoc
Additional information
hi, just wanna know if you have tested with other chunk methods? Are all other methods have the same problem?
The issue you're experiencing with tri-column formatted PDFs in RAGFlow's deepdoc component might be related to how the RAGFlowPdfParser handles multi-column layouts. The parser uses the LayoutRecognizer to identify and process different layout types, but it seems that tri-column layouts are not being handled correctly, leading to intermingled content [1].
To address this, you might want to explore the following options:
-
Check for Updates: Ensure you are using the latest version of RAGFlow, as updates might include improvements or fixes for handling complex layouts.
-
Adjust LayoutRecognizer Settings: Although specific parameters for improving tri-column parsing aren't documented, experimenting with the
LayoutRecognizersettings might help. This could involve adjusting how text boxes are sorted and merged based on their spatial relationships. -
OCR and Table Structure Recognition: Ensure that OCR and table structure recognition are properly configured, as these components play a crucial role in accurately extracting and organizing content from complex layouts [1].
-
Custom Model Training: If the built-in models are insufficient, consider training a custom model to better handle tri-column layouts, especially if your documents have unique formatting characteristics.
These steps might help improve the parsing of tri-column PDFs. If the issue persists, further investigation into the specific implementation details of the RAGFlowPdfParser might be necessary.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
0.18
Other environment information
Actual behavior
- The content from the three columns is intermingled
- pls refer the snapshot
Expected behavior
- Correctly recognize content in three-column layouts.
Steps to reproduce
- Upload pdf
- Process 'General' parse with deepdoc
Additional information
![]()
hi, just wanna know if you have tested with other chunk methods? Are all other methods have the same problem?
yeah same problem!
- paper
- manual
Could you share this file with us? [email protected]
Could you share this file with us? [email protected]
sure. It is an open-domain document.
FYI @KevinHuSh
@KevinHuSh Is it because of this function "sort_Y_firstly"? It did not take into account double-column or multi-column reading order. https://github.com/infiniflow/ragflow/blob/bf7f7c7027d8b3eca4d01c9c36347dffdefc4da6/deepdoc/vision/recognizer.py#L55
Hi @KevinHuSh, just checking if there's any update ? Running into the same thing here. Any pointers appreciated! 👍
I am also getting the same problem. I use DeepDoc with paper option