ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]:Parts of content missing for PDF

Open majia1984 opened this issue 1 year ago • 4 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch name

deepdoc

Commit ID

sdf234cdwsdfwer

Other environment information

No response

Actual behavior

pdf识别文字会缺失内容, 12c3b333c8986cc92fce5fe05f8d49a (整理)兰炭工程施工的难点识别及监理实施办法.(1).pdf

Expected behavior

解析失败

Steps to reproduce

dfasdfds

Additional information

No response

majia1984 avatar Apr 26 '24 11:04 majia1984

You could try it with 'manual' or 'laws ' chunking method.

KevinHuSh avatar Apr 28 '24 02:04 KevinHuSh

I tried it with 'General' method on the demo. It works fine.

KevinHuSh avatar Apr 28 '24 08:04 KevinHuSh

I'm using the General mode with a block size of 2048. When dividing the file above, some things are lost. Please use the file above for testing.

majia1984 avatar Apr 30 '24 06:04 majia1984

删除 layout_recognizer.py 里面的 garbage_layouts即可,作者对有些布局做了屏蔽

ywandy avatar Jun 20 '24 08:06 ywandy