dothinking comments

Results 56 comments of


                                            dothinking

trafficstars

Issue with graphs and bullets

Sorry for my so late reply. - I guess the bullets issue is caused by a wrong font name, so it can't be rendered correctly. The font name issue was...

pdf2docx/common/share.py

Thanks for reporting this. The following should be a workaround. Will be fixed in new version. ```python num = len(components or (1,1,1)) # white if components is None ```

重复解析float_image，导致速度非常慢

感谢提交问题。当前版本确实存在这样的问题，开发中的版本正在重构解析方法，试图解决包括这个问题在内的多个常见问题，例如段落划分等。可惜最近时间有限，还在完成中。关于你提到的 > 我发现在解析的时候程序会把一幅图片切碎解析成float_image 实际上是原来的PDF生成程序把完整的图片切碎了，当前版本的pdf2docx 试图检测这种情况并把图片拼接回去，大概率因为算法的问题导致速度慢。

pdf转word时候，原pdf中目录页虚线丢了，点击跳转也丢了

目前确实还不支持 `目录` 功能，只是简单当成正文处理了。将来会研究支持这一块，可能还需要点儿时间。

Error faced when running pdf2docx; please kindly assist!

> Awaiting for assistance which I hope comes soon HAHA Sorry for the late reply coming half of a year later. If you're still working for this topic, a test...

三栏pdf提取文本缺失

感谢issue和提供测试文件。主要问题是目前 `pdf2docx` 版面分析功能较弱，而测试文件版面相对复杂。目前还没有比较好的思路处理这个问题，可能需要一定时间。 #258 提到了相关思路，供参考。

三栏pdf提取文本缺失

> 是否考虑过增加版面分析模型呢？这是我目前的方案谢谢您的项目，帮了很大忙之前想过，但是在我有限的尝试中，尚未发现一个兼顾效果和通用性的模型，所以无限搁置中。如果您有好的推荐，欢迎分享或者贡献。谢谢。

> 我之前实习时做了pdf转txt的工作，其中pdf转word使用的该库（pdf2docx），然后word转txt是手写的。也在很大程度上实现了去除页眉页脚，但仅仅能满足于输出端是txt（不提取多列的表格）。在我实习期间处理了500w+本的pdf转txt，并在公司内部上线了部署服务。我走后接手这个工作的实习生又进行了优化，具体改进我没问。我想看看大家对这个需求大不大，我可以选择新建一个开源库或者在pdf2docx提一个pr。希望有需求的可以在下方留言目前页眉页脚都被当成正文处理，很多人都对识别页眉页脚这个功能提出了期待。欢迎分享识别页眉页脚的思路，或者直接提PR。

pdf2docx is not able to join tables that span across two pages

Thanks for reporting this feature. Table across pages is not supported yet, but I will take time to look into it.

dothinking

pdf转word公式乱码

Issue with graphs and bullets

pdf2docx/common/share.py

重复解析float_image，导致速度非常慢

pdf转word时候，原pdf中目录页虚线丢了，点击跳转也丢了

Error faced when running pdf2docx; please kindly assist!

三栏pdf提取文本缺失

三栏pdf提取文本缺失

去除页眉页脚的工作

pdf2docx is not able to join tables that span across two pages