ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: Deepdoc doesn’t handle tri-column formatted PDFs very well

Open flowerljl opened this issue 7 months ago • 6 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

5d5dbb3

RAGFlow image version

0.18

Other environment information


Actual behavior

  • The content from the three columns is intermingled
  • pls refer the snapshot

Expected behavior

  • Correctly recognize content in three-column layouts.

Steps to reproduce

- Upload pdf
- Process 'General' parse with deepdoc

Additional information

Image

Image

flowerljl avatar May 14 '25 08:05 flowerljl

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

5d5dbb3

RAGFlow image version

0.18

Other environment information

Actual behavior

  • The content from the three columns is intermingled
  • pls refer the snapshot

Expected behavior

  • Correctly recognize content in three-column layouts.

Steps to reproduce

  • Upload pdf
  • Process 'General' parse with deepdoc

Additional information

Image

Image

hi, just wanna know if you have tested with other chunk methods? Are all other methods have the same problem?

lamhktommy avatar May 14 '25 08:05 lamhktommy

The issue you're experiencing with tri-column formatted PDFs in RAGFlow's deepdoc component might be related to how the RAGFlowPdfParser handles multi-column layouts. The parser uses the LayoutRecognizer to identify and process different layout types, but it seems that tri-column layouts are not being handled correctly, leading to intermingled content [1].

To address this, you might want to explore the following options:

  1. Check for Updates: Ensure you are using the latest version of RAGFlow, as updates might include improvements or fixes for handling complex layouts.

  2. Adjust LayoutRecognizer Settings: Although specific parameters for improving tri-column parsing aren't documented, experimenting with the LayoutRecognizer settings might help. This could involve adjusting how text boxes are sorted and merged based on their spatial relationships.

  3. OCR and Table Structure Recognition: Ensure that OCR and table structure recognition are properly configured, as these components play a crucial role in accurately extracting and organizing content from complex layouts [1].

  4. Custom Model Training: If the built-in models are insufficient, consider training a custom model to better handle tri-column layouts, especially if your documents have unique formatting characteristics.

These steps might help improve the parsing of tri-column PDFs. If the issue persists, further investigation into the specific implementation details of the RAGFlowPdfParser might be necessary.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar May 14 '25 08:05 dosubot[bot]

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

5d5dbb3

RAGFlow image version

0.18

Other environment information

Actual behavior

  • The content from the three columns is intermingled
  • pls refer the snapshot

Expected behavior

  • Correctly recognize content in three-column layouts.

Steps to reproduce

  • Upload pdf
  • Process 'General' parse with deepdoc

Additional information

Image Image

hi, just wanna know if you have tested with other chunk methods? Are all other methods have the same problem?

yeah same problem!

  • paper

Image

  • manual

Image

flowerljl avatar May 14 '25 08:05 flowerljl

Could you share this file with us? [email protected]

KevinHuSh avatar May 15 '25 01:05 KevinHuSh

Could you share this file with us? [email protected]

sure. It is an open-domain document.

FYI @KevinHuSh

_融入式_境外企业档案工作策略分析——以万宝矿产有限公司缅甸项目档案工作为例.pdf

flowerljl avatar May 16 '25 07:05 flowerljl

@KevinHuSh Is it because of this function "sort_Y_firstly"? It did not take into account double-column or multi-column reading order. https://github.com/infiniflow/ragflow/blob/bf7f7c7027d8b3eca4d01c9c36347dffdefc4da6/deepdoc/vision/recognizer.py#L55

Danee-wawawa avatar Jun 12 '25 09:06 Danee-wawawa

Hi @KevinHuSh, just checking if there's any update ? Running into the same thing here. Any pointers appreciated! 👍

Image

xujryan avatar Jul 18 '25 05:07 xujryan

I am also getting the same problem. I use DeepDoc with paper option

Image

ikoshos-gland avatar Aug 29 '25 17:08 ikoshos-gland