pdf2docx icon indicating copy to clipboard operation
pdf2docx copied to clipboard

Ignore page 1 due to making page error: list index out of range - 字体下划线干扰表格解析

Open ALawating-Rex opened this issue 3 years ago • 4 comments
trafficstars

作者你好, 我用C++程序打印的pdf , 导出的pdf 当文字粗体且带有下划线时, 程序报错提示 : Ignore page 1 due to making page error: list index out of range 去掉下划线正常。 另外当文字是一个 粗体的数字时 ,比如粗体 1 转成的 word 显示两个 1 即: 11 image image

ALawating-Rex avatar Sep 01 '22 04:09 ALawating-Rex

Uploading 1.pdf…

ALawating-Rex avatar Sep 01 '22 04:09 ALawating-Rex

也可以在这里下载 pdf : https://github.com/ALawating-Rex/filePutON/blob/main/pdf2docx/pdf/withUnderline.pdf

ALawating-Rex avatar Sep 01 '22 04:09 ALawating-Rex

感谢提供测试文件。

我用C++程序打印的pdf , 导出的pdf 当文字粗体且带有下划线时,程序报错提示 : Ignore page 1 due to making page error: list index out of range

两个下划线与周围的边框被错误当成表格来解析,却又无法得到一个合理的表格结构,因此报错。将来会对表格解析这一块做一个系统改进。

另外当文字是一个 粗体的数字时 ,比如粗体 1 转成的 word 显示两个 1 即: 11

这个PDF是通过重叠文字来模拟文字加粗的效果,例如直接从PDF复制文字粘贴出来,Formula得到FFoorrmmuullaa。上游提取PDF文本的库PyMuPDFFFoorrmmuullaa识别成了两个单词FormulaFormula,只不过几乎重叠,所以后续pdf2docx解析时得以通过判断重叠去掉了一个(参考下面转换时的日志),其他单词包括数字如2也是相同的情况。然而,原始的1直接被PyMuPDF当成了11而不是两个重叠的1,因此pdf2docx无法去重。

[WARNING] Ignore Line "Formula" due to overlap
[WARNING] Ignore Line "Value" due to overlap
[WARNING] Ignore Line "2 " due to overlap
[WARNING] Ignore Line "3 " due to overlap
[WARNING] Ignore Line "Method" due to overlap
[WARNING] Ignore Line "GA" due to overlap
[WARNING] Ignore Line "Fetus A" due to overlap
[WARNING] Ignore Line "2D" due to overlap

dothinking avatar Sep 25 '22 10:09 dothinking

感谢回复,期待修复 :)

ALawating-Rex avatar Sep 25 '22 10:09 ALawating-Rex