pdf2docx issues

运行提示Ignore Line "<image>" due to overlap，转成docx文件后PDF中表格的文字存在遗失

4

您好，我在运行PDF转docx文件时，提示 mupdf: expected object number [INFO] Start to convert 581529124_40000_5783.pdf [INFO] [1/4] Opening document... [INFO] [2/4] Analyzing document... [WARNING] Ignore hidden text checking due to UnicodeDecodeError in upstream library. [WARNING]...

xxentropy

bug

pdf2docx fails on the pdf generated by LaTex from the document with LaTex documentclass=uspatent

Original source: [pdf2docx-test-doc.tex](https://people.freebsd.org/~yuri/pdf2docx-test-doc.tex) PDF: [pdf2docx-test-doc.pdf](https://people.freebsd.org/~yuri/pdf2docx-test-doc.pdf) DOCX: [pdf2docx-test-doc.docx](https://people.freebsd.org/~yuri/pdf2docx-test-doc.docx) Problems in DOCX: * The text is shown in 2 columns instead of 1 column on page#3 * Page#4 is left empty for...

yurivict

重复解析float_image，导致速度非常慢

3

能否提供一些思路，如果我有能力我会提PR

alexw994

Extra line breaks

Hi there! Thanks a lot, your tool is really awesome! There is one question: I am faced with the fact that not all but some of the paragraphs for some...

kazuser

解析pdf出现缺失大量

3

源pdf内容 ![image](https://user-images.githubusercontent.com/16497860/231669605-3dc1ee31-802b-4722-9a3e-e388865f3acb.png) 解析后的内容 ![image](https://user-images.githubusercontent.com/16497860/231669660-a6e33b3d-3ec0-462f-b434-d213e446ef77.png) 源pdf内容 ![image](https://user-images.githubusercontent.com/16497860/231669736-0218dff5-90c3-479e-8a65-15380df0d07e.png) 解析后的内容 ![image](https://user-images.githubusercontent.com/16497860/231669817-15e57877-d1d9-41af-90a7-9255b20a0681.png) [972fa50687484dd6.pdf](https://github.com/dothinking/pdf2docx/files/11218727/972fa50687484dd6.pdf)

ruoshuixuelabi

表格表头数据提取不出来

2

首先感谢大佬提供这么好的工具！在使用extract table方法的时候提出的数据有遗漏如图 ![image](https://github.com/dothinking/pdf2docx/assets/10828528/5fc9174c-69e7-443a-9fe5-64ce8cf024cc) 测试文件： [test1.pdf](https://github.com/dothinking/pdf2docx/files/11493902/test1.pdf)

fefefefefefe

After extract_tables some values have <NEST TABLE>

7

Hello. After application of function `extract_tables` in some lists I get values ``. Is it possible to extract data from ``? If necessary, I can give an example pdf file,...

Rustemhak

[WARNING] Ignore Line "<image>" due to overlap

1

[WARNING] Ignore Line "" due to overlap 提示警告信息后，没有进度了。

enjoyzed

information required

Underlined text is misinterpreted as table and following lines are impacted

Hi All, I have a PDF file that has lines with breaks/whitespaces inbetween. following lines are underlined and again followed by line breaks. while converting to DOCX, we see that...

vishyarjun

表格解析丢失内容

感谢大佬提供这么好的工具！使用中发现了一个问题：解析表格时，会丢失部分线条，原始 PDF 文件、转换后的 docx 文件、丢失内容（已用红色框标出）如附件所示 ![丢失内容](https://user-images.githubusercontent.com/68527951/226574670-ea15510e-2bab-4654-8c84-287ce7090097.png) [page3.docx](https://github.com/dothinking/pdf2docx/files/11027018/page3.docx) [page3.pdf](https://github.com/dothinking/pdf2docx/files/11027020/page3.pdf)

Jason-Di

pdf2docx
pdf2docx copied to clipboard

Metadata

运行提示Ignore Line "<image>" due to overlap，转成docx文件后PDF中表格的文字存在遗失

pdf2docx fails on the pdf generated by LaTex from the document with LaTex documentclass=uspatent

重复解析float_image，导致速度非常慢

Extra line breaks

解析pdf出现缺失大量

表格表头数据提取不出来

After extract_tables some values have <NEST TABLE>

[WARNING] Ignore Line "<image>" due to overlap

Underlined text is misinterpreted as table and following lines are impacted

表格解析丢失内容

← Metadata

Owner

Metadata

pdf2docx pdf2docx copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdf2docx
pdf2docx copied to clipboard