pdf2docx issues

Conversion very irregular and out of format

2

Please see the attached .pdf file and the resulting .docx file. The format becomes split and very weird. [generated.docx](https://github.com/dothinking/pdf2docx/files/7277825/generated.docx) [out.pdf](https://github.com/dothinking/pdf2docx/files/7277826/out.pdf)

HassanRaza1313

Header Text

2

During conversion of PDF to docx file, sometime this library unable to make Header/Footer of Pdf files to Header/footers of Docx file. Do this library have a solution to my...

Tonystarq

feature

question

tex2pdf2word

3

Pdf2docx can not convert `\begin{cases}` and `\frac{}{}` correctly.

ghost

enhancement

feature

是否可以利用OCR支持扫描版PDF

7

是否可提供OCR 或者版面分析接口支持扫描版PDF

bikerr

feature

question

Ignore page 1 due to making page error: list index out of range - 字体下划线干扰表格解析

4

作者你好，我用C++程序打印的pdf ，导出的pdf 当文字粗体且带有下划线时，程序报错提示： Ignore page 1 due to making page error: list index out of range 去掉下划线正常。另外当文字是一个粗体的数字时，比如粗体 1 转成的 word 显示两个 1 即：...

ALawating-Rex

bug

upstream bug

bug issue

感谢楼主提供这么好的工具！使用中发现个小bug： [[https://github.com/dothinking/pdf2docx/blob/master/pdf2docx/table/TableStructure.py](url) ](https://github.com/dothinking/pdf2docx/blob/313be6223798516a10ebf381de473b0be56953af/pdf2docx/table/TableStructure.py#L400-L407) 如果404行的for走不进去，那么cells为[[]]，后面的代码会bug：'list index out of range'

Mr-wang2016

关于多栏布局/版面分析的一些探讨

5

各位开发者好，我是 [Umi-OCR](https://github.com/hiroi-sora/Umi-OCR) 的作者。 Umi-OCR 是一个开源的OCR软件，目前正在开发PDF扫描件识别的功能。其中的一个难点在于，OCR得到的文本块的顺序，往往与实际阅读顺序不符合，特别是在多栏布局的文档中。我需要根据文档的排版，正确区分出不同列，按实际阅读顺序为文本块进行排序。 pdf2docx 中也涉及一些基于规则的排版解析功能。我浅读了部分代码，这给了我一些启发。最终，我设计出一个新算法： [GapTree_Sort 间隙树排序法](https://github.com/hiroi-sora/GapTree_Sort_Algorithm) 。它通过寻找文本块之间的间隙，将页面切割为不同的纵向区块，构建出布局树。最后，前序遍历布局树，即可得到符合人类阅读习惯的文本排序。当然，除了排序文本块，也能通过布局树分析更多排版信息。（不过它不是针对PDF设计的，没有考虑块对象本身附带的标签等信息。） pdf2docx 当前的规则匹配，只支持最多2栏、且列宽不能相差太大。而 GapTree_Sort 支持更复杂的排版情况。如：任意多栏布局(>2)，列宽不一致，跨多列区块等。另，该算法对于常见布局的时间复杂度仅为 O(n) ，n为文本块数量。仓库中有证明。 GapTree_Sort 是个刚开发的算法，可能有很多不完善的地方；各位可以来测试或提供一些建议。仓库内有示例代码和更详细的算法流程介绍。 https://github.com/hiroi-sora/GapTree_Sort_Algorithm

hiroi-sora

feature

有计划支持公式的转换吗

2

![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37822176/9bbb1b57-5ac0-429c-b816-9246775c0297)

UchihaArk

Inconsistent Table formatting while conversion

2

Dear Developer, Thank you so much for developing such a cool and amazing library. It really is very helpful. I am using this functionality as a part of my bigger...

richa27gpt

这个项目最大的问题在于数据结构设计

6

整体的数据结构设计导致了非改项目发起者去修改问题和维护很困难 make_docx 递归的方式，无论是性能还是可维护性都是非常糟糕的设计...

nunamia

enhancement

postponed

pdf2docx
pdf2docx copied to clipboard

Metadata

Conversion very irregular and out of format

Header Text

tex2pdf2word

是否可以利用OCR支持扫描版PDF

Ignore page 1 due to making page error: list index out of range - 字体下划线干扰表格解析

bug issue

关于多栏布局/版面分析的一些探讨

有计划支持公式的转换吗

Inconsistent Table formatting while conversion

这个项目最大的问题在于数据结构设计

← Metadata

Owner

Metadata

pdf2docx pdf2docx copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdf2docx
pdf2docx copied to clipboard