ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: RAGFlowPdfParser string index out of range

Open JudeZzz1997 opened this issue 10 months ago • 1 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch name

main

Commit ID

d159753d

Other environment information

No response

Actual behavior

D:\Program\Anaconda\anaconda3\envs\deepdoc\python.exe "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py" WARNING:root:Miss outlines dt_boxes num : 12, elapsed : 0.3172268867492676 dt_boxes num : 30, elapsed : 0.3868682384490967 dt_boxes num : 66, elapsed : 0.3832690715789795 dt_boxes num : 15, elapsed : 0.38872528076171875 dt_boxes num : 34, elapsed : 0.40897107124328613 dt_boxes num : 34, elapsed : 0.40706396102905273 dt_boxes num : 38, elapsed : 0.3497426509857178 dt_boxes num : 36, elapsed : 0.4208505153656006 dt_boxes num : 40, elapsed : 0.36147046089172363 dt_boxes num : 36, elapsed : 0.36443257331848145 preprocess Traceback (most recent call last): File "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py", line 1176, in ans = parser("../dataset/扫描件.pdf") File "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py", line 1032, in call self._concat_downward() File "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py", line 502, in _concat_downward dfs(boxes[0], 1) File "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py", line 493, in dfs fea = self._updown_concat_features(up, down) File "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py", line 100, in _updown_concat_features up["text"][-1] + down["text"][0]) else "")
IndexError: string index out of range

Expected behavior

No response

Steps to reproduce

after bug <https://github.com/infiniflow/ragflow/issues/446> was fixed, another IndexError appeared when ran the HuParser class in deepdoc/parser/pdf_parser.py and parsed a scanned pdf

Additional information

No response

JudeZzz1997 avatar Apr 19 '24 07:04 JudeZzz1997

fixed.

KevinHuSh avatar Apr 28 '24 06:04 KevinHuSh

对于特定的pdf还是会解析错误,但是我没有发现正确解析和错误解析的pdf之间的区别在哪。或许是因为pdf里面有图片的关系

AMAG-AB avatar May 30 '24 10:05 AMAG-AB