ragflow
ragflow copied to clipboard
[Bug]: RAGFlowPdfParser string index out of range
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch name
main
Commit ID
d159753d
Other environment information
No response
Actual behavior
D:\Program\Anaconda\anaconda3\envs\deepdoc\python.exe "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py"
WARNING:root:Miss outlines
dt_boxes num : 12, elapsed : 0.3172268867492676
dt_boxes num : 30, elapsed : 0.3868682384490967
dt_boxes num : 66, elapsed : 0.3832690715789795
dt_boxes num : 15, elapsed : 0.38872528076171875
dt_boxes num : 34, elapsed : 0.40897107124328613
dt_boxes num : 34, elapsed : 0.40706396102905273
dt_boxes num : 38, elapsed : 0.3497426509857178
dt_boxes num : 36, elapsed : 0.4208505153656006
dt_boxes num : 40, elapsed : 0.36147046089172363
dt_boxes num : 36, elapsed : 0.36443257331848145
preprocess
Traceback (most recent call last):
File "D:\Jude\study project\my_deepdoc\parser\pdf_parser.py", line 1176, in
IndexError: string index out of range
Expected behavior
No response
Steps to reproduce
after bug <https://github.com/infiniflow/ragflow/issues/446> was fixed, another IndexError appeared when ran the HuParser class in deepdoc/parser/pdf_parser.py and parsed a scanned pdf
Additional information
No response
fixed.
对于特定的pdf还是会解析错误,但是我没有发现正确解析和错误解析的pdf之间的区别在哪。或许是因为pdf里面有图片的关系