dothinking
dothinking
Many thanks for providing a good case. > Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated?...
Sorry to say no, for now. Header/Footer was in my feature backlog, but unfortunately not implemented yet since didn't get time to this library recently.
Do you mean math equation/formula/expression? If so, you're correct -> these are not typical text, `pdf2docx` doesn't parse equation for now. But would be a good idea to include this...
Agree, that's also what in my mind. Hopefully it can be enhanced in two stages: - in short term, recognize the bbox of equation and crop a picture accordingly as...
@alexw994 本意是隐藏所有文字,只留下图片。因为对于单通道图片,通过`pymupdf`提取出来的图片的颜色不对,就改用直接截图的方式。此时为了避免截取到潜在的文字,所以事先隐藏所有文字。 你上传的截图中,剩下的应该都是图片了,包括看到的“文字”。如果确实是文字而没被隐藏,方便的话请提供原始PDF供测试。谢谢。
如果你提到的五角星是字符,那可能是之前字体名称错误导致的,请升级到最新版本后尝试。 ``` pip install pdf2docx --upgrade ``` 如果是矢量图,可能是版式识别上的问题,最好能提供pdf文件供进一步分析。谢谢。
If you're still working on this file, please upgrade pdf2docx to the latest version and have a try. if the issue persists, could you please provide a pdf file for...
Feel free to reopen it for further discussion.
非常感谢指出问题及提供的测试文件。 目前这个库已经发布的功能尚未在版面分析方面作工作,而是直接基于规则利用了PDF中导出的原始信息,因此对相对复杂一点的排版例如科技论文很容易就出现段落、章节划分错误的问题。目前正在利用空余时间断断续续做些版面分析的研究,希望下一个版本可以改善这个问题。感谢你的支持。
版面分析是 pdf2docx 缺失的一环,目前的工作只能算机械的两栏划分。我会学习一下你的算法,希望可以拓展、集成到这个库中。非常感谢。