pdf2docx
pdf2docx copied to clipboard
Open source Python library for converting PDF to DOCX.
如题,转换时遇到字体名为中文(比如“宋体”)时,发生错误 bytes must be in range[0 to 255] 错误点在 https://github.com/ArtifexSoftware/pdf2docx/blame/master/pdf2docx/common/share.py#L128 当字体名称为中文时,ord(c)大于255,转换成bytes时会报错 ```python def decode(s:str): '''Try to decode a unicode string.''' b = bytes(ord(c) for c in s) ### 这里出错 for...
Is there any support for ANDROID? or, How can I import this library? Need suggestion or documentation if have any. Thanks guys!
``` File "/usr/local/lib/python3.7/site-packages/pdf2docx/page/RawPage.py", line 67, in restore raw_dict = self.extract_raw_dict(**settings) File "/usr/local/lib/python3.7/site-packages/pdf2docx/page/RawPageFitz.py", line 33, in extract_raw_dict image_blocks = self._preprocess_images(**settings) File "/usr/local/lib/python3.7/site-packages/pdf2docx/page/RawPageFitz.py", line 118, in _preprocess_images return ImagesExtractor(self.page_engine).extract_images(settings['clip_image_res_ratio']) File "/usr/local/lib/python3.7/site-packages/pdf2docx/image/ImagesExtractor.py", line...
I color the lines in pdf for the entire size of the sheet, everything is colored in pdf format. I need to convert to docx format, but when converting, the...
我试了好几个pdf,但都存在页面超出的问题。 比如   这要如何解决呢?可不可以通过设置一个判定,如果文本超过了bbox,就将文字的size缩小呢?文字自动改变大小以适应框的大小,即牺牲文字的样式而保留整体的布局。这是我的一个想法,不知道可不可行。
I've encountered an issue with paragraph splitting in some documents, where certain pages separate sentences in the same paragraph into different text blocks while others do not. Upon investigation, I...
在识别pdf中发现存在两个问题, 1 无法在docx文件中还原 pdf文件中的隐藏表格的一部分显示线段, 比如样本中的红线是一个表格的一条框线。 2 文字段落无法实现首行缩进 样本如下图:  [zf1.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/14913074/zf1.pdf)
I'm getting an output that isn't accurate. Some images aren't on the same space as the original PDF. Here is a sample: Before:  After:  There image multiplied by...
the log is as below: [INFO] [1/4] Opening document... [INFO] [2/4] Analyzing document... unsupported colorspace for '{output}'