pdf2docx issues

2 tests fail

[log](https://freebsd.org/~yuri/pdf2docx-0.5.8-tests.txt) Version: 0.5.8 Python-3.9 FreeBSD-14.0

yurivict

add pdf convert txt with remove Header and Footer

lbboier

PDF转docx时文档中带链接的文字全部丢失

1

你好，我在Mac上通过pip安装了最新版，在用命令行将一个PDF文件转化成docx时，发现原来PDF中带链接的内容（文字、链接）全部丢失了，我不懂原因，所以冒昧来问一下是否是bug，还是我操作上有什么问题？

everydoc

pdf2docx-0.5.8版,将附件"深入浅出强化学习01.pdf"转docx后,每段首句被移到末尾了

2

common\Collection.py文件sort_in_line_order函数中作如下修改可修复: # if not self.is_vertical_text: if self.is_vertical_text:

ericshenjs

input needed

修复pix.tobytes失败的问题跟修复jpeg附带旋转信息时插入docx中角度错误的bug

第一个问题在https://github.com/ArtifexSoftware/pdf2docx/issues/198 实践了，目前tobytes基本能处理，带来的问题需要依赖PIL的图片库第二个问题发现一些jpeg图片样本插入到pdf后fitz没法获取exif，只能用get_image_rects返回的矩阵推测旋转，目前测试了jpg的7中exif情况，若有遗漏，可以补充，具体的样本： [7931 7937胶TBT3139-2021 1.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/15386876/7931.7937.TBT3139-2021.1.pdf) ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/9d6e6f86-072b-4624-867c-53e91cdb0561) 修复后 ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/5b949dda-0f85-46f3-bbfa-3a6b4a18bb9c)

heweisheng

关于行高分配的逻辑疑问

最近在做ocr还原扫描件（使用飞浆的面版识别+reportlib生成还原pdf），目前pdf排版比较方便，所以打算先转pdf在用pdf2docx(花时间写一套根据ocr实现排版感觉可以直接扩展这个项目，但是暂时还没有时间去扩展) 看了下pdf解析的时候可能存在多行一个段落的情况，但是多行的情况下行高应该要均分给每一行才对会出现问题的具体情况： [test_7.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/15348474/test_7.pdf) ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/c1471671-bbc5-42c7-9e72-28373dbfd9de) ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/1c6a1d20-a2d3-4799-80a3-f1ed064faeb3) 使用这个逻辑转换： ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/5684d2bf-80b6-4d14-9aaa-83ea78762f22) 均分行高： ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/82d2688e-5e8f-4563-b710-7c1d4c32816f) 另外可否中间插入空格行去做到排版尽量跟原来相似呢？

heweisheng

Table is broken when the table is displayed on 2 pages

1

Hello, I am generating a PDF using wkhtml2pdf. In the PDF, the table is displayed on two pages because it is too big. When I convert this PDF into docx,...

pulse-mind

pdf2docx.Converter将某些特殊pdf转word时，某个子进程会卡住

3

[1784766219303317505.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/15233767/1784766219303317505.pdf) 这是一份ppt转成的pdf，里面每张都是图片+少量文本框，在一个有资源限制的pod中，连续多次调用pdf2docx.Converter将其转成word，会出现在mou某次转换时，卡在page 7，之后不再执行任何操作，且每次发生都是在page 7 ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/38556796/0121a7f9-e702-4286-870c-f97292bf101a)

starlxx

enhancement

Fix negative `ref_dif` causing incorrect paragraph splitting

Fixes #282 Changes Made: - Modified line 452 in `common_vertical_spacing` of `Blocks.py` to ensure `ref_dif` is non-negative.

tehwenyi

含XFA表单域的PDF无法转换为word

含XFA表单域的PDF转换为word的结果为： Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document....

HotSun6

pdf2docx
pdf2docx copied to clipboard

Metadata

2 tests fail

add pdf convert txt with remove Header and Footer

PDF转docx时文档中带链接的文字全部丢失

pdf2docx-0.5.8版,将附件"深入浅出强化学习01.pdf"转docx后,每段首句被移到末尾了

修复pix.tobytes失败的问题跟修复jpeg附带旋转信息时插入docx中角度错误的bug

关于行高分配的逻辑疑问

Table is broken when the table is displayed on 2 pages

pdf2docx.Converter将某些特殊pdf转word时，某个子进程会卡住

Fix negative `ref_dif` causing incorrect paragraph splitting

含XFA表单域的PDF无法转换为word

← Metadata

Owner

Metadata

pdf2docx pdf2docx copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdf2docx
pdf2docx copied to clipboard