PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

提取中文pdf出现乱码

Open java668 opened this issue 1 year ago • 1 comments

Description of the bug

Pythonܔزၥᦶ໛ຝҁ෫႕҂

ܔزၥᦶ༷ᬿ

Pythonၥᦶ໛ຝ

՗᫫կຝ຅ጱ᥯ଶ๶᧔҅ၥᦶ๋᯿ᥝጱྍṈฎࣁ᫫կ୏ݎጱ෸ײኴفྲ᫾অ҅ಅզࣁ෱๗ၥᦶጱኴف҅

՗᫫կᕪၧ਍ጱ᥯ଶӤ๶᧔҅ݎሿጱᳯ᷌ᥴ٬౮๜֗҅ಭفጱᩒრྲ᫾੝̶ࢩྌ҅੒Ӟӻၥᦶጱᔮᕹ҅

୏ত๋֯ጱၥᦶ੪ฎრդᎱᕆڦጱၥᦶ҅Ԟ੪ฎܔزၥᦶᴤྦྷ҅ᬯӻᬦᑕԞᤩ౮ԅጮፋၥᦶ̶ܔزၥᦶ

ฎ๋च๜Ԟฎ๋ବ੶ጱၥᦶᔄࣳ҅ܔزၥᦶଫአԭ๋च๜ጱ᫫կդᎱ҅ইᔄ҅ڍහ̶ොဩᒵ҅ܔزၥᦶ

᭗ᬦݢಗᤈጱෙ᥺༄ັᤩၥܔزጱᬌڊฎވჿ᪃ᶼ๗ᕮຎ̶ࣁၥᦶᰂਁरጱቘᦞӤ๶᧔҅᩼ஃӥጱၥᦶ

ಭفᩒრ᩼ṛ҅஑کጱࢧಸሲ᩼य़҅ᥠၥᦶᰂਁरཛྷࣳғ

ಲ୏᫫կຝ຅ጱ੶ᶎ҅ࣁᛔۖ۸ၥᦶጱ֛ᔮӾ҅ܔزၥᦶ໛ຝզ݊ܔزၥᦶጱᎣᦩ֛ᔮฎ஠ᶳᥝഩൎጱ

ದᚆԏӞ҅ܔزၥᦶጱᎣᦩ֛ᔮฎᛔۖ۸ၥᦶૡᑕ૵զ݊ၥᦶ୏ݎૡᑕ૵ጱᎣᦩ֛ᔮԏӞ҅ᘒӬฎ஠ᶳ

ٍ॓ጱᎣᦩԏӞ̶ࣁPython᧍᥺Ӿଫአ๋ଠာጱܔزၥᦶ໛ຝฎunittest޾pytest,unittestંԭຽٵପ҅

ݝᥝਞᤰԧPythonᥴ᯽࢏ݸ੪ݢզፗള੕فֵአԧ,pytestฎᒫӣොጱପ҅ᵱᥝܔᇿጱਞᤰ̶ܔزၥᦶ໛

ຝጱᎣᦩ֛ᔮ੪ࢱᕰunittest޾pytest๶ᦖᥴ̶

ጮፋၥᦶܻቘ pdf文件: Python单元测试框架.pdf

How to reproduce the bug

解析pdf文件出现乱码

PyMuPDF version

1.23.x or earlier

Operating system

Linux

Python version

3.11

java668 avatar May 31 '24 10:05 java668

Please describe in English!

JorjMcKie avatar May 31 '24 10:05 JorjMcKie

Please describe in English!

Please describe in English! Using this tool to parse PDF Chinese documents resulted in garbled characters. Could you please help me take a look? Thank you very much. PDF document: Python单元测试框架.pdf

java668 avatar May 31 '24 10:05 java668

This PDF is full of errors - see the following log during open:

import pymupdf
doc = pymupdf.open("Python (1).pdf")
print(pymupdf.TOOLS.mupdf_warnings())
format error: cannot recognize xref format
trying to repair broken xref
repairing PDF document
Bad or missing parent pointer in outline tree, repairing
... repeated 4 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing

When then saving to just contain the first page, no PDF viewer or extraction tool can extract meaningful text.

doc.select([0])
doc.ez_save("page1.pdf")

JorjMcKie avatar May 31 '24 11:05 JorjMcKie

This PDF is full of errors - see the following log during open:

import pymupdf
doc = pymupdf.open("Python (1).pdf")
print(pymupdf.TOOLS.mupdf_warnings())
format error: cannot recognize xref format
trying to repair broken xref
repairing PDF document
Bad or missing parent pointer in outline tree, repairing
... repeated 4 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing

When then saving to just contain the first page, no PDF viewer or extraction tool can extract meaningful text.

doc.select([0])
doc.ez_save("page1.pdf")

https://github.com/pypdfium2-team/pypdfium2 This can be extracted. Can you help me take a look? Thank you very much

java668 avatar Jun 01 '24 07:06 java668

Sorry - as I wrote: this file has severe defects. Whether or not some tools may still be able to extract things despite of this is a matter outside the scope we can deal with.

JorjMcKie avatar Jun 01 '24 08:06 JorjMcKie

Sorry - as I wrote: this file has severe defects. Whether or not some tools may still be able to extract things despite of this is a matter outside the scope we can deal with.

好的,Thank you very much

java668 avatar Jun 02 '24 04:06 java668

Sorry - as I wrote: this file has severe defects. Whether or not some tools may still be able to extract things despite of this is a matter outside the scope we can deal with.

This PDF is full of errors - see the following log during open:

import pymupdf
doc = pymupdf.open("Python (1).pdf")
print(pymupdf.TOOLS.mupdf_warnings())
format error: cannot recognize xref format
trying to repair broken xref
repairing PDF document
Bad or missing parent pointer in outline tree, repairing
... repeated 4 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing

When then saving to just contain the first page, no PDF viewer or extraction tool can extract meaningful text.

doc.select([0])
doc.ez_save("page1.pdf")

How can I determine whether this PDF has errors? Is there a corresponding API? Thank you very much

java668 avatar Jun 03 '24 02:06 java668

How can I determine whether this PDF has errors? Is there a corresponding API?

Some errors are already detected when the PDF is opened - like in this case, where the central cross reference (xref) table is broken. MuPDF will then try to repair things by generating a new xref table from walking through he full file. This is usually accompanied by error and warning messages. Some of those are written to the console, the full message are also stored in the area pymupdf.TOOLS.mupdf_warnings() - as shown.

Whether a repair had been tried can be determined by looking at doc.is_repaired.

Not all errors can be detected at open time though. Some will only be exhibited when certain information is extracted like text or during rendering the pages' visual appearance.

JorjMcKie avatar Jun 03 '24 06:06 JorjMcKie

How can I determine whether this PDF has errors? Is there a corresponding API?

Some errors are already detected when the PDF is opened - like in this case, where the central cross reference (xref) table is broken. MuPDF will then try to repair things by generating a new xref table from walking through he full file. This is usually accompanied by error and warning messages. Some of those are written to the console, the full message are also stored in the area pymupdf.TOOLS.mupdf_warnings() - as shown.

Whether a repair had been tried can be determined by looking at doc.is_repaired.

Not all errors can be detected at open time though. Some will only be exhibited when certain information is extracted like text or during rendering the pages' visual appearance.

ok, Thank you very much!

java668 avatar Jun 03 '24 08:06 java668