pdf2docx issues

段落划分有点问题

1

hi，感谢作者有这么好的库！！！最近在使用的时候，有个文件期望段落能够这么分出来 ![image](https://user-images.githubusercontent.com/1823364/184281868-91a26a63-acb8-48c9-a008-c3532852e6b0.png) 但是好像分的有点问题，第二个段落，由于单词之间的间距变大了，每个单词都被划分为段落 ![image](https://user-images.githubusercontent.com/1823364/184281970-12b825af-650e-4410-9acd-5192be287a01.png) 原始文件如下，改文件的第1页 [1.pdf](https://github.com/dothinking/pdf2docx/files/9313570/1.pdf)

fruitbars

bug

enhancement

Tabulation Problem

Hello ! I have a summary on a PDF that looks like this: ![image](https://user-images.githubusercontent.com/67562521/182177455-ed31a3e2-35dc-4a85-8427-483a5df8dce7.png) Problem, when I convert the PDF to docx, I get : ![image](https://user-images.githubusercontent.com/67562521/182177795-0c14087f-3f12-4d6b-8692-14bbb7285159.png) The PDF file :...

Mathie01

<class 'UnicodeDecodeError'> returned a result with an error set

12

环境：python 3.7，pip 22.1.2，pdf2docx 0.5.4，PyMuPDF 1.20.0，python-docx 0.8.11 步骤代码： ![image](https://user-images.githubusercontent.com/87413355/175495224-6c612050-9486-4af5-b5b0-922dd136eb54.png) 报错情况： ![image](https://user-images.githubusercontent.com/87413355/175495789-2d82cd3b-ab65-4356-9472-7c75a638b14f.png)

qq774724635

bug

upstream

index error: list index out of range

1

index error: list index out of range File "D:\Anaconda\lib\site-packages\pdf2docx\text\Textspan.py", line 130 self.chars[0].origin, # the bottom left point of the first character ps: when I modified the code, the program stopped...

sy108

input needed

转换后的汉字，经常会出现很多比如康熙部首等unicode的字符

1

转换后的汉字，经常会出现很多比如康熙部首等unicode的字符

ZHangZHengEric

input needed

矢量图被错误解析成表格

5

![image](https://user-images.githubusercontent.com/30330903/139228227-64a52e2d-f874-43fb-850e-6560a43a904e.png) 问题如图所示。文件链接： 246KB，链接:https://pan.baidu.com/s/1zYVu1UrAc2CyVpd6eT_LDg 提取码:i481

mjTree

bug

information required

Analyzing document... cost 6 min due to lots of duplicated images

3

[Test pdf](https://github.com/dothinking/pdf2docx/files/8559585/5CE48DAAB7DB616A.pdf) [docx](https://github.com/dothinking/pdf2docx/files/8559613/5.docx) convert log ``` [INFO] Start to convert g://pdf/5CE48DAAB7DB616A.pdf [INFO] [1/4] Opening document... [INFO] [2/4] Analyzing document... [INFO] [3/4] Parsing pages... [INFO] (1/29) Page 1 [INFO] (2/29) Page...

bikerr

bug

upstream

Handle index error in paragraphs

1

Some documents can't be processed page by page due to an index error. As a result pages are blank. This small fix handles the exception are pages are being extracted...

kcho-mirato

There is no hebrew support

4

Hey Author, It support the hebrew and arabic letters but it write it in Inverted letters where the code do the convert and get the letter? can you give me...

Ravid-Levy

bug

Is there any way to improve the layout restoration？

3

[1804.10371.pdf](https://github.com/dothinking/pdf2docx/files/8858068/1804.10371.pdf) [1804.10371.docx](https://github.com/dothinking/pdf2docx/files/8858069/1804.10371.docx)

liuxunfei

enhancement

question

pdf2docx
pdf2docx copied to clipboard

Metadata

段落划分有点问题

Tabulation Problem

<class 'UnicodeDecodeError'> returned a result with an error set

index error: list index out of range

转换后的汉字，经常会出现很多比如康熙部首等unicode的字符

矢量图被错误解析成表格

Analyzing document... cost 6 min due to lots of duplicated images

Handle index error in paragraphs

There is no hebrew support

Is there any way to improve the layout restoration？

← Metadata

Owner

Metadata

pdf2docx pdf2docx copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdf2docx
pdf2docx copied to clipboard