dothinking
dothinking
> 目前看来应该是pdf格式的问题,我的pdf格式为ASNI,转换成utf-8是可以的,但是转换为已经没有内容了 pdf格式为ASNI怎么理解,你是如何转换成utf-8的?
> 遇到类似问题,是一个msoffice2007创建的pdf文件。 不知是否方便上传或者发我邮箱你的pdf以便查找原因?感谢。
@ranger2001 感谢提供测试文件。 可以确定是 `get_texttrace()` 的问题,这是上游处理PDF的库 `PyMuPDF` 提供的方法,而它又是对上上游`MuPDF`相应函数的封装。彻底解决这个问题只有等他们官方修复,周期会比较长。 临时地,可以采用`try-except`的方式忽略这个错误。例如,找到 `RawPageFitz.py` 第70行(...\site-packages\pdf2docx\page\RawPageFitz.py): ```python try: spans = self.page_engine.get_texttrace() except SystemError: # logging.warning('Ignore hidden text checking due to UnicodeDecodeError in upstream library.') spans =...
`pdf2docx`用`get_texttrace()` 来检测隐藏的文本,然后根据需要是否输出到转化后的docx。例如,一些扫描的PDF书籍尤其是年代较远的文献,在扫描的图片层后面隐藏着一个OCR的文本层,方便文字复制和搜索。此时,可以通过设置参数`ocr=0`或者`ocr=1`来选择只输出图片或者只输出文本到docx(避免图片和文本的重叠),参考 #132 。 综合来看,绝大多数情况下都不需要考虑隐藏文本的问题,并且`get_texttrace()` 仅仅是对某些中文字体可能有问题,因此以上临时修复适用于绝大多数情况。
Hi @kcho-mirato , many thanks for your pull request. I don't quite understand why `doc.paragraphs[-2]` might have index issue, because, at this moment, we have at least 2 sections, where...
@kcho-mirato no problem. Thanks!
Hi, Ravid, Sorry for the late reply. It's not the first time I receive this issue report, but I'm not able to resolve it due to knowing nothing about the...
> where the code do the convert and get the letter? Extract text with `PyMuPDF`, which seems recognize the rtl language correctly. Then, write text to docx with `python-docx`. Based...
By the way, the relation between `Line` and `TextSpan`: `Line` consists of a list of `TextSpan`-s, while the letters are contained in each `TextSpan`. For example, a line "a brown...
Hi liuxunfei, it seems no pdf and docx are uploaded.