pdf2docx issues

转换时遇到字体名为中文（比如“宋体”）时，发生错误

2

如题，转换时遇到字体名为中文（比如“宋体”）时，发生错误 bytes must be in range[0 to 255] 错误点在 https://github.com/ArtifexSoftware/pdf2docx/blame/master/pdf2docx/common/share.py#L128 当字体名称为中文时，ord(c)大于255，转换成bytes时会报错 ```python def decode(s:str): '''Try to decode a unicode string.''' b = bytes(ord(c) for c in s) ### 这里出错 for...

hlhtddx

input needed

Any support for ANDROID?

Is there any support for ANDROID? or, How can I import this library? Need suggestion or documentation if have any. Thanks guys!

rrsaikat

ValueError: unsupported colorspace for 'png'

``` File "/usr/local/lib/python3.7/site-packages/pdf2docx/page/RawPage.py", line 67, in restore raw_dict = self.extract_raw_dict(**settings) File "/usr/local/lib/python3.7/site-packages/pdf2docx/page/RawPageFitz.py", line 33, in extract_raw_dict image_blocks = self._preprocess_images(**settings) File "/usr/local/lib/python3.7/site-packages/pdf2docx/page/RawPageFitz.py", line 118, in _preprocess_images return ImagesExtractor(self.page_engine).extract_images(settings['clip_image_res_ratio']) File "/usr/local/lib/python3.7/site-packages/pdf2docx/image/ImagesExtractor.py", line...

bikerr

转word速度太慢了，怎么设置只转换部分内容？比如只转换pdf中表格到word，不要页眉页脚段落，也许这样指定内容更快

HotSun6

enhancement

postponed

How to save highlight in table after convert pdf to docx

4

I color the lines in pdf for the entire size of the sheet, everything is colored in pdf format. I need to convert to docx format, but when converting, the...

Herrifly

enhancement

转化后存在页面超出的问题

我试了好几个pdf，但都存在页面超出的问题。比如 ![原样式](https://github.com/ArtifexSoftware/pdf2docx/assets/154854888/a812d62c-6e61-428a-b8a9-54ae39bfcc50) ![页面超出](https://github.com/ArtifexSoftware/pdf2docx/assets/154854888/dd3ac2dd-084f-4a17-b230-0cf0138684ca) 这要如何解决呢？可不可以通过设置一个判定，如果文本超过了bbox，就将文字的size缩小呢？文字自动改变大小以适应框的大小，即牺牲文字的样式而保留整体的布局。这是我的一个想法，不知道可不可行。

cyxxg

bug

Negative ref_dif in Blocks.py causing paragraph splitting

I've encountered an issue with paragraph splitting in some documents, where certain pages separate sentences in the same paragraph into different text blocks while others do not. Upon investigation, I...

tehwenyi

无法复原pdf文件中表格的框线

2

在识别pdf中发现存在两个问题， 1 无法在docx文件中还原 pdf文件中的隐藏表格的一部分显示线段，比如样本中的红线是一个表格的一条框线。 2 文字段落无法实现首行缩进样本如下图： ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/35327931/48b6e97c-2d70-4c6f-a211-6bbd904418cc) [zf1.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/14913074/zf1.pdf)

ericosmic

wontfix

upstream bug

[WARNING] Ignore Line "<image>" due to overlap

I'm getting an output that isn't accurate. Some images aren't on the same space as the original PDF. Here is a sample: Before: ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/89592598/ad05390c-3142-4046-9328-24d2b4955646) After: ![image](https://github.com/ArtifexSoftware/pdf2docx/assets/89592598/88675742-ebef-496a-88ad-5404d6ac7d4b) There image multiplied by...

GohanHango

transfer error：unsupported colorspace for '{output}'

1

the log is as below: [INFO] [1/4] Opening document... [INFO] [2/4] Analyzing document... unsupported colorspace for '{output}'

plainee

pdf2docx
pdf2docx copied to clipboard

Metadata

转换时遇到字体名为中文（比如“宋体”）时，发生错误

Any support for ANDROID?

ValueError: unsupported colorspace for 'png'

转word速度太慢了，怎么设置只转换部分内容？比如只转换pdf中表格到word，不要页眉页脚段落，也许这样指定内容更快

How to save highlight in table after convert pdf to docx

转化后存在页面超出的问题

Negative ref_dif in Blocks.py causing paragraph splitting

无法复原pdf文件中表格的框线

[WARNING] Ignore Line "<image>" due to overlap

transfer error：unsupported colorspace for '{output}'

← Metadata

Owner

Metadata

pdf2docx pdf2docx copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdf2docx
pdf2docx copied to clipboard