dothinking comments

Results 56 comments of


                                            dothinking

trafficstars

Is there any way to improve the layout restoration？

Many thanks for providing a good case. > Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated?...

Header Text

Sorry to say no, for now. Header/Footer was in my feature backlog, but unfortunately not implemented yet since didn't get time to this library recently.

tex2pdf2word

Do you mean math equation/formula/expression? If so, you're correct -> these are not typical text, `pdf2docx` doesn't parse equation for now. But would be a good idea to include this...

tex2pdf2word

Agree, that's also what in my mind. Hopefully it can be enhanced in two stages: - in short term, recognize the bbox of equation and crop a picture accordingly as...

_hide_page_text这个函数的并没有隐藏全部的文字

@alexw994 本意是隐藏所有文字，只留下图片。因为对于单通道图片，通过`pymupdf`提取出来的图片的颜色不对，就改用直接截图的方式。此时为了避免截取到潜在的文字，所以事先隐藏所有文字。你上传的截图中，剩下的应该都是图片了，包括看到的“文字”。如果确实是文字而没被隐藏，方便的话请提供原始PDF供测试。谢谢。

无法识别某些符号

如果你提到的五角星是字符，那可能是之前字体名称错误导致的，请升级到最新版本后尝试。 ``` pip install pdf2docx --upgrade ``` 如果是矢量图，可能是版式识别上的问题，最好能提供pdf文件供进一步分析。谢谢。

SystemError: <built-in function Page_get_cdrawings> returned a result with an error set

If you're still working on this file, please upgrade pdf2docx to the latest version and have a try. if the issue persists, could you please provide a pdf file for...

Handle index error in paragraphs

Feel free to reopen it for further discussion.

段落划分有点问题

非常感谢指出问题及提供的测试文件。目前这个库已经发布的功能尚未在版面分析方面作工作，而是直接基于规则利用了PDF中导出的原始信息，因此对相对复杂一点的排版例如科技论文很容易就出现段落、章节划分错误的问题。目前正在利用空余时间断断续续做些版面分析的研究，希望下一个版本可以改善这个问题。感谢你的支持。

关于多栏布局/版面分析的一些探讨

版面分析是 pdf2docx 缺失的一环，目前的工作只能算机械的两栏划分。我会学习一下你的算法，希望可以拓展、集成到这个库中。非常感谢。