dothinking comments

Results 56 comments of


                                            dothinking

trafficstars

<class 'UnicodeDecodeError'> returned a result with an error set

> 目前看来应该是pdf格式的问题，我的pdf格式为ASNI，转换成utf-8是可以的，但是转换为已经没有内容了 pdf格式为ASNI怎么理解，你是如何转换成utf-8的？

<class 'UnicodeDecodeError'> returned a result with an error set

> 遇到类似问题，是一个msoffice2007创建的pdf文件。不知是否方便上传或者发我邮箱你的pdf以便查找原因？感谢。

<class 'UnicodeDecodeError'> returned a result with an error set

@ranger2001 感谢提供测试文件。可以确定是 `get_texttrace()` 的问题，这是上游处理PDF的库 `PyMuPDF` 提供的方法，而它又是对上上游`MuPDF`相应函数的封装。彻底解决这个问题只有等他们官方修复，周期会比较长。临时地，可以采用`try-except`的方式忽略这个错误。例如，找到 `RawPageFitz.py` 第70行（...\site-packages\pdf2docx\page\RawPageFitz.py）： ```python try: spans = self.page_engine.get_texttrace() except SystemError: # logging.warning('Ignore hidden text checking due to UnicodeDecodeError in upstream library.') spans =...

<class 'UnicodeDecodeError'> returned a result with an error set

`pdf2docx`用`get_texttrace()` 来检测隐藏的文本，然后根据需要是否输出到转化后的docx。例如，一些扫描的PDF书籍尤其是年代较远的文献，在扫描的图片层后面隐藏着一个OCR的文本层，方便文字复制和搜索。此时，可以通过设置参数`ocr=0`或者`ocr=1`来选择只输出图片或者只输出文本到docx（避免图片和文本的重叠），参考 #132 。综合来看，绝大多数情况下都不需要考虑隐藏文本的问题，并且`get_texttrace()` 仅仅是对某些中文字体可能有问题，因此以上临时修复适用于绝大多数情况。

Handle index error in paragraphs

Hi @kcho-mirato , many thanks for your pull request. I don't quite understand why `doc.paragraphs[-2]` might have index issue, because, at this moment, we have at least 2 sections, where...

Handle index error in paragraphs

@kcho-mirato no problem. Thanks!

There is no hebrew support

Hi, Ravid, Sorry for the late reply. It's not the first time I receive this issue report, but I'm not able to resolve it due to knowing nothing about the...

There is no hebrew support

> where the code do the convert and get the letter? Extract text with `PyMuPDF`, which seems recognize the rtl language correctly. Then, write text to docx with `python-docx`. Based...

There is no hebrew support

By the way, the relation between `Line` and `TextSpan`: `Line` consists of a list of `TextSpan`-s, while the letters are contained in each `TextSpan`. For example, a line "a brown...

Is there any way to improve the layout restoration？

Hi liuxunfei, it seems no pdf and docx are uploaded.