pdf2docx icon indicating copy to clipboard operation
pdf2docx copied to clipboard

Open source Python library for converting PDF to DOCX.

Results 122 pdf2docx issues
Sort by recently updated
recently updated
newest added

[log](https://freebsd.org/~yuri/pdf2docx-0.5.8-tests.txt) Version: 0.5.8 Python-3.9 FreeBSD-14.0

你好,我在Mac上通过pip安装了最新版,在用命令行将一个PDF文件转化成docx时,发现原来PDF中带链接的内容(文字、链接)全部丢失了,我不懂原因,所以冒昧来问一下是否是bug,还是我操作上有什么问题?

common\Collection.py文件sort_in_line_order函数中作如下修改可修复: # if not self.is_vertical_text: if self.is_vertical_text:

input needed

第一个问题在https://github.com/ArtifexSoftware/pdf2docx/issues/198 实践了,目前tobytes基本能处理,带来的问题需要依赖PIL的图片库 第二个问题发现一些jpeg图片样本插入到pdf后fitz没法获取exif,只能用get_image_rects返回的矩阵推测旋转,目前测试了jpg的7中exif情况,若有遗漏,可以补充,具体的样本: [7931 7937胶TBT3139-2021 1.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/15386876/7931.7937.TBT3139-2021.1.pdf) ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/9d6e6f86-072b-4624-867c-53e91cdb0561) 修复后 ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/5b949dda-0f85-46f3-bbfa-3a6b4a18bb9c)

最近在做ocr还原扫描件(使用飞浆的面版识别+reportlib生成还原pdf),目前pdf排版比较方便,所以打算先转pdf在用pdf2docx(花时间写一套根据ocr实现排版感觉可以直接扩展这个项目,但是暂时还没有时间去扩展) 看了下pdf解析的时候可能存在多行一个段落的情况,但是多行的情况下行高应该要均分给每一行才对 会出现问题的具体情况: [test_7.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/15348474/test_7.pdf) ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/c1471671-bbc5-42c7-9e72-28373dbfd9de) ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/1c6a1d20-a2d3-4799-80a3-f1ed064faeb3) 使用这个逻辑转换: ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/5684d2bf-80b6-4d14-9aaa-83ea78762f22) 均分行高: ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/37769332/82d2688e-5e8f-4563-b710-7c1d4c32816f) 另外可否中间插入空格行去做到排版尽量跟原来相似呢?

Hello, I am generating a PDF using wkhtml2pdf. In the PDF, the table is displayed on two pages because it is too big. When I convert this PDF into docx,...

[1784766219303317505.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/15233767/1784766219303317505.pdf) 这是一份ppt转成的pdf,里面每张都是图片+少量文本框,在一个有资源限制的pod中,连续多次调用pdf2docx.Converter将其转成word,会出现在mou某次转换时,卡在page 7,之后不再执行任何操作,且每次发生都是在page 7 ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/38556796/0121a7f9-e702-4286-870c-f97292bf101a)

enhancement

Fixes #282 Changes Made: - Modified line 452 in `common_vertical_spacing` of `Blocks.py` to ensure `ref_dif` is non-negative.

含XFA表单域的PDF转换为word的结果为: Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document....