PyPDF4
PyPDF4 copied to clipboard
strange issue with resolvedObjects
Unfortunately, I cannot share the source documents that are causing this problem, so what I'm instead looking for is some hints as to where I may look to find what could be causing this (I took a look at the source for PdfFileReader
and nothing is jumping out at me) so that I could create a workaround at a minimum.
Using IPython, this is what I get:
In [1]: import PyPDF4
In [2]: pdf = PyPDF4.PdfFileReader('some_file.pdf')
In [3]: for key, val in pdf.resolvedObjects.items():
...: print(key, val)
...:
(0, 634) {'/DecodeParms': {'/Columns': 3, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Index': [614, 22], '/Info': IndirectObject(613, 0), '/Prev': 1820112, '/Root': IndirectObject(615, 0), '/Size': 636, '/Type': '/XRef', '/W': [1, 2, 0]}
(0, 611) {'/DecodeParms': {'/Columns': 4, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Info': IndirectObject(613, 0), '/Root': IndirectObject(615, 0), '/Size': 614, '/Type': '/XRef', '/W': [1, 3, 0]}
In [4]: pdf.resolvedObjects
Out[4]: ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
381 if cls in self.type_pprinters:
382 # printer registered in self.type_pprinters
--> 383 return self.type_pprinters[cls](obj, self, cycle)
384 else:
385 # deferred printer
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
610 and not (p.max_seq_length and len(obj) >= p.max_seq_length):
611 keys = _sorted_for_pprint(keys)
--> 612 for idx, key in p._enumerate(keys):
613 if idx:
614 p.text(',')
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/lib/pretty.py in _enumerate(self, seq)
284 def _enumerate(self, seq):
285 """like enumerate, but with an upper limit on the number of items"""
--> 286 for idx, x in enumerate(seq):
287 if self.max_seq_length and idx >= self.max_seq_length:
288 self.text(',')
RuntimeError: dictionary changed size during iteration
In [5]: for key, val in pdf.resolvedObjects.items():
...: print(key, val)
...:
(0, 634) {'/DecodeParms': {'/Columns': 3, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Index': [614, 22], '/Info': IndirectObject(613, 0), '/Prev': 1820112, '/Root': IndirectObject(615, 0), '/Size': 636, '/Type': '/XRef', '/W': [1, 2, 0]}
(0, 611) {'/DecodeParms': {'/Columns': 4, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Info': IndirectObject(613, 0), '/Root': IndirectObject(615, 0), '/Size': 614, '/Type': '/XRef', '/W': [1, 3, 0]}
(0, 610) {'/Filter': '/FlateDecode', '/First': 6, '/N': 1, '/Type': '/ObjStm'}
(0, 613) {'/CreationDate': "D:20150730143930+10'00'", '/Creator': '28C-1', '/ModDate': "D:20150803093650+10'00'", '/Producer': 'Develop ineo+ 280'}
(0, 615) {'/Metadata': IndirectObject(608, 0), '/OpenAction': [IndirectObject(616, 0), '/Fit'], '/Pages': IndirectObject(612, 0), '/Type': '/Catalog'}
(0, 608) {'/Subtype': '/XML', '/Type': '/Metadata'}
(0, 609) {'/Filter': '/FlateDecode', '/First': 6, '/N': 1, '/Type': '/ObjStm'}
(0, 612) {'/Count': 26, '/Kids': [IndirectObject(616, 0), IndirectObject(1, 0), IndirectObject(23, 0), IndirectObject(43, 0), IndirectObject(53, 0), IndirectObject(68, 0), IndirectObject(109, 0), IndirectObject(127, 0), IndirectObject(163, 0), IndirectObject(217, 0), IndirectObject(275, 0), IndirectObject(305, 0), IndirectObject(334, 0), IndirectObject(389, 0), IndirectObject(414, 0), IndirectObject(426, 0), IndirectObject(435, 0), IndirectObject(460, 0), IndirectObject(468, 0), IndirectObject(478, 0), IndirectObject(489, 0), IndirectObject(508, 0), IndirectObject(518, 0), IndirectObject(540, 0), IndirectObject(554, 0), IndirectObject(575, 0)], '/Type': '/Pages'}
so pdf.resolvedObjects
is clearly changing somehow, in that the __init__
method seems to be giving something incomplete. I can make a workable-ish workaround via:
In [15]: pdf = PyPDF4.PdfFileReader('some_file.pdf')
In [16]: pdf._flatten()
In [17]: for key, val in pdf.resolvedObjects.items():
...: print(key, val)
...:
(0, 634) {'/DecodeParms': {'/Columns': 3, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Index': [614, 22], '/Info': IndirectObject(613, 0), '/Prev': 1820112, '/Root': IndirectObject(615, 0), '/Size': 636, '/Type': '/XRef', '/W': [1, 2, 0]}
(0, 611) {'/DecodeParms': {'/Columns': 4, '/Predictor': 12}, '/Filter': '/FlateDecode', '/ID': [b'<edited out>', b'<edited out>'], '/Info': IndirectObject(613, 0), '/Root': IndirectObject(615, 0), '/Size': 614, '/Type': '/XRef', '/W': [1, 3, 0]}
(0, 615) {'/Metadata': IndirectObject(608, 0), '/OpenAction': [IndirectObject(616, 0), '/Fit'], '/Pages': IndirectObject(612, 0), '/Type': '/Catalog'}
(0, 609) {'/Filter': '/FlateDecode', '/First': 6, '/N': 1, '/Type': '/ObjStm'}
(0, 612) {'/Count': 26, '/Kids': [IndirectObject(616, 0), IndirectObject(1, 0), IndirectObject(23, 0), IndirectObject(43, 0), IndirectObject(53, 0), IndirectObject(68, 0), IndirectObject(109, 0), IndirectObject(127, 0), IndirectObject(163, 0), IndirectObject(217, 0), IndirectObject(275, 0), IndirectObject(305, 0), IndirectObject(334, 0), IndirectObject(389, 0), IndirectObject(414, 0), IndirectObject(426, 0), IndirectObject(435, 0), IndirectObject(460, 0), IndirectObject(468, 0), IndirectObject(478, 0), IndirectObject(489, 0), IndirectObject(508, 0), IndirectObject(518, 0), IndirectObject(540, 0), IndirectObject(554, 0), IndirectObject(575, 0)], '/Type': '/Pages'}
.
.
.
so I can at least do the tasks I need to do with the document, but making a call to pdf.resolvedObjects
still raises an exception the first time I try to use it.
Any idea what may be causing this? Would be more than happy to help with a fix if I can get some help tracking down the source of the problem.
It appears you were using Python 3.6 in that example, am I right? What do you get with Python 2 instead?
I am still fairly new to the codebase and I haven't delved deeply into this specific issue yet, but of the PDF samples in the PDF_Samples/
dir. none seems to have their list of objects read. I wouldn't exclude to have a wider problem than the one just declared.
from os import listdir
from PyPDF4 import PdfFileReader
DIR = "PDF_Samples/"
for f in listdir(DIR):
if f.endswith(".pdf"):
r = PdfFileReader(DIR + f)
print("len(r.resolvedObjects) = %d" % len(r.resolvedObjects))
$ python3 ./resolved_objects.py
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1799]
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0
len(r.resolvedObjects) = 0
The vary same is yielded by Python 2.
The method responsible for populating PdfFileReader.resolvedObjects
is cacheIndirectObject()
. Its stack trace would ideally be:
__init__() > read() > cacheIndirectObject()
but some internal machinery prevents cacheIndirectObject()
from being ever reached (which I'm sure about for having placed a quick'n'dirty print()
statement at the beginning of the method).
Ah yes, I think you've found the culprit--seems to work as it should in 2.7.
Alright then, that means that the problem is a 2 vs. 3 thing buried somewhere in the PdfFileReader.read
method (called on line 1148 in __init__
) which uses cacheIndirectObject
(among other things) to populate self. resolvedObjects
.
Time to put on my detective hat I guess, that is a particularly gruesome block of code, but at least I have a hint of what's going on now.
EDIT: Sorry, I see you already said that. It's very early and my brain is still waking up it would seem.
OK, I set debug=True
in PdfFileReader.read
and added a really informative message after line 1866 (if x.isdgit():
) just to see if the code is even making it to the caching call, and this is what I now get:
In [1]: import PyPDF4
In [2]: pdf = PyPDF4.PdfFileReader('some_file.pdf')
>>read <_io.BytesIO object at 0x11019c938>
line: b''
line: b'%%EOF'
****** I am here ******
read idx_pairs=[(614, 22)]
XREF Uncompressed: 614 0
XREF Uncompressed: 615 0
.
.
.
****** I am here ******
read idx_pairs=[(0, 614)]
XREF Uncompressed: 1 0
.
.
.
XREF Compressed: 612 609 0
XREF Compressed: 613 610 0
So the reader appears to be sort-of doing it's thing in that it is at least finding the items to put into the document list, and the two appearances of my very informative print statement correspond to the two objects that are appearing in the the .resolvedObjects
attribute. It's just that nothing else is making it in. But this is alright, search for the culprit narrowing rapidly.
EDIT:
Alright, this is going to be much more tricky than I thought, as the code is really, really convoluted. However, what seems to be the happening is that __init__
calling self.read(stream)
, which does a lot of hard-to-follow things, but the main hangup seems to be that within some of the process there are calls to self.getObject(), only they are called via things like:
...
fields = tree["/Fields"]
for f in fields:
field = f.getObject()
...
but this doesn't make sense at this point, since f
is not at attribute of self
at the onset, so how can it have a method .getObject()
to call? So it seems like __init__
is then leaving .resolvedObjects
as pointer to stuff, so that when I try to call it within IPython which forces everything to be evaluated, we end up with the error that I got in the first place.
I may, of course, be going the totally wrong way with this, but it seems a plausible culprit at this point; any thoughts?
First off, what revision have you checked out while testing this code, @DeliciousHair? I'm going to pinpoint the problem across the latest commits and see if it has been introduced there. (Your HEAD tip I assume to be claird:master
anyway.)
I'm assuming that you're asking about this?
$ git describe --tags
v1.27.0-9-g2ca3e19
FWIW, I'm in the process of tracing out how PdfFileReader.read()
actually is functioning, as I think the problem is down a bunch of convoluted back-and-forth that should likely be streamlined anyway.
Getting closer. Maybe.
In .read()
there is the line:
newTrailer = readObject(stream, self)
actually there are a few calls to readObject(stream, self)
, but following this through piece by piece we see around line 1664 (I've made some changes, hence the approximation) there is:
self.stream.seek(start, 0)
which is clearly a problem when this is being called from within __init__
, since self.stream
is only defined after self.read(stream)
has been run. Simply switching the order of these lines in __init__
doesn't fix thing unfortunately.
I do not maintain this repository, but I leave #11 at you alone instead of working simultaneously at the same thing. If you (think you) have solved the problem, I invite you to submit a PR; if not you can pass the issue to me sharing what you have found while working at it.
Besides this issue, I have the suspect that resolvedObjects
should on average contain more object references that it currently does. Much of PyPDF4 has come untested and, if it is the case that it doesn't meet to the specifications, it would do it some good being refactored/recoded (where relevant) and have some unit tests deployed.
No problem, I'll keep plugging away at this as I could really stand to have it working but I should add that I am not really much of a coder, so my repairs could end up being as much of a mess as the current thing.
To try and meet said specifications, where can they be found? In particular, I have completely abandoned 2.7 some time ago now and a lot of cleaning-up could be accomplished immediately by dropping a bunch of the wrappers for binary data present that are essentially pointless for 3.x.
To try and meet said specifications, where can they be found?
Eh eh, good question. I've been contributing to this project lately assuming that there were some, while indeed I have added an incomplete bare-minimum of mine in README.md
and many others are still lacking.
I think that the project owner should be concerned with setting up some simple but effective contribution guidelines for allowing casual contributors such as me and you to stay in line with the rules. (Hey @claird, I'm open for that position, i.e. drafting simple but useful contribution rules!)
In particular, I have completely abandoned 2.7
Good thing. Quoting the requests library, you chose Python 3+ and you're a person of taste. FYI, PyPDF4 seems will be supporting 2.7 and version 3, which is definitely recommended.
I am not really much of a coder
If that's so, I'll act as an intermediary between this issue and a possible future pull request that will solve it. You point me to the alleged mistake in the code and I'll take care of fixing it through a PR. Mail me if you want to keep in contact ;-).
Hi again @DeliciousHair. I cannot yet be 100% sure about this, but it seems that resolvedObjects
is intended for internal use only (and should, indeed, be renamed to _resolvedObjects
, like many other alleged public methods).
So it looks like you were doing an incorrect use of the library. What I would suggest, instead, is to rely on the PdfFileReader.getObject()
method, which I have documented in one of my latest revisions, and whose use is demonstrated in part below:
from os.path import join
from PyPDF4 import PdfFileReader
from PyPDF4.generic import IndirectObject
DIR = "PDF_Samples/"
file = "AutoCad_Diagram.pdf"
r = PdfFileReader(join(DIR, file), debug=False)
o = filter(
lambda e: e is not None,
[r.getObject(IndirectObject(idnum, 0, r)) for idnum in range(1, 19)]
)
o = list(o)
print("len(o) == %d\n" % len(o))
$ python3 ./resolved_objects.py
len(o) == 18
Now -- I know, I know. How to know which object references are there? This is something which I am trying to see myself. getObject()
requires an IndirectObject
instance with the generation and identifier numbers, but you have to know these. For what I've seen until now, no public methods/properties store the list of allowable (gen. num, identifier)
entries; resolvedObjects
seems to act as a cache dictionary and indeed you were surprised to see it changing unfathomably.
That said, if I've been correct in my analysis, a feature to extract the list of indirect objects in the File Body indexed by the Cross-Reference Table (this is all ISO 32000/PDF jargon) could be definitely added :+1:.
@newnone:
Wow, that's a lot of stuff done! Apologies for just leaving you hanging but I seem to have missed the notifications for the previous two responses. I've had to put this effort aside myself as work commitments, but will definitely be checking out your modifications very shortly though.
Great work! :-)
(in the meantime I've discovered something even stranger, but I raised that in #21 instead.)
I have tried this PR on a number of files and overall it works much better when it works, but it also falls over critically in a number of instances where master
is able to trundle along, albeit in a convoluted manner. I cannot share the sample documents I'm using, but leave this with me and I'll share the logs at least. Just not today due to time constraints unfortunately.
Splendid, I'll work toward diminishing those failure cases and improving PyPDF even better.
It seemed to me that you were using resolvedObjects
(now renamed to _cachedObjects
) to access the indirect objects of your PDF files. Whether that was its use or not, can we consider this issue closed now that #14 has been merged?
I remind you that if you wish to access all the indirect objects from a PDF file, you should resort to PdfFileReader.objects()
.
My only contribution to this thread is to applaud the progress you've made.
I understand that PyPDF4 might appear to have regressed for a few specific documents. I'm sure that's part of a larger move to a richer testing suite.
Cameron Laird, vice president We make computers work for people.
On Tue, Oct 2, 2018 at 10:34 AM Oscar [email protected] wrote:
It seemed to me that you were using resolvedObjects (now renamed to _cachedObjects) to access the indirect objects of your PDF files. Whether that was its use or not, can we consider this issue closed now that #14 https://github.com/claird/PyPDF4/pull/14 has been merged?
I remind you that if you wish to access all the indirect objects from a PDF file, you should resort to PdfFileReader.objects().
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/claird/PyPDF4/issues/11#issuecomment-426320209, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbN9LLWkeBwcIFeDt-9aaNtvp35cjZ_ks5ug4eigaJpZM4WWcpx .
I think I've found a bit of a pattern to the hard-fails. I've got a large volume of documents that are TIFF scans that are placed into a PDF container, and a large minority of them have been further modified using, presumably, something like MS paint and then exported to PDF via ghostscript. A convoluted process for sure, but I have no control over the source material.
Regardless, with the previous (ie, pre #14 merge) the behaviour with these documents was problematic, but the new merge makes them completely inaccessible:
In [1]: import pypdf
In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf')
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
<ipython-input-2-8ac78e782e5f> in <module>()
----> 1 pdf = pypdf.PdfFileReader('failing_sample.pdf')
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in __init__(self, stream, strict, warndest, overwriteWarnings, debug)
1311
1312 self.stream = stream
-> 1313 self._parsePdfFile(stream)
1314
1315 def __repr__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _parsePdfFile(self, stream)
2369 elif self.strict:
2370 raise PdfReadError(
-> 2371 "Unknown xref type: %s" % xrefType
2372 )
2373
PdfReadError: Unknown xref type: 255
In [3]: pdf = pypdf.PdfFileReader('failing_sample.pdf', strict=False)
In [4]: pdf.getPage(0)
PdfReadWarning: Object 1 0 not defined. [pdf.py:2076]
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
<ipython-input-4-b34ec9cc413a> in <module>()
----> 1 pdf.getPage(0)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getPage(self, pageNumber)
1461 # Ensure that we're not trying to access an encrypted PDF
1462 if self._flattenedPages is None:
-> 1463 self._flatten()
1464
1465 return self._flattenedPages[pageNumber]
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _flatten(self, pages, inherit, indirectRef)
1814 if pages is None:
1815 self._flattenedPages = []
-> 1816 catalog = self._trailer["/Root"].getObject()
1817 pages = catalog["/Pages"].getObject()
1818
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in __getitem__(self, key)
570
571 def __getitem__(self, key):
--> 572 return dict.__getitem__(self, key).getObject()
573
574 def getXmpMetadata(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in getObject(self)
198
199 def getObject(self):
--> 200 return self.pdf.getObject(self).getObject()
201
202 def __repr__(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getObject(self, ref)
2077 )
2078 raise PdfReadError(
-> 2079 "Could not find object (%d, %d)" % (ref.idnum, ref.generation)
2080 )
2081
PdfReadError: Could not find object (1, 0)
vs. with the pre-#14 version:
In [1]: import PyPDF4 as pypdf
In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf')
In [3]: pdf.getPage(0).keys()
Out[3]: dict_keys(['/Resources', '/MediaBox', '/Type', '/Parent', '/Contents', '/Rotate'])
If you like, I may be able to share some prototype examples with you directly; I suspect that this is now a fairly small fix to the monumental amount of work you've already done. Please get in touch with me directly if you're interested.
In the meantime, I will start going through what you have done for #14 and see if I can figure out the failure point as well as I really want to migrate to the newer version--when it works, it works sooo good! Excellent job!
If you like, I may be able to share some prototype examples with you directly
I ask you the sample files, without hesitation. If need to be private, head them to [email protected].
"... I have no control over the source material ...": I assume that all of us with any degree of expertise in PDF recognize that these workflows we program are largely mistakes that can't be better rationalized because of some external constraint. PDF work always seems to be that way.
My summary: we understand that you're working with imperfect materials. I personally very much appreciate your efforts, DeliciousHair, to improve PyPDF4's still so-primitive testing.
Cameron Laird, vice president We make computers work for people.
On Fri, Oct 5, 2018 at 5:15 PM DeliciousHair [email protected] wrote:
I think I've found a bit of a pattern to the hard-fails. I've got a large volume of documents that are TIFF scans that are placed into a PDF container, and a large minority of them have been further modified using, presumably, something like MS paint and then exported to PDF via ghostscript. A convoluted process for sure, but I have no control over the source material.
Regardless, with the previous (ie, pre #14 https://github.com/claird/PyPDF4/pull/14 merge) the behaviour with these documents was problematic, but the new merge makes them completely inaccessible:
In [1]: import pypdf
In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf')
PdfReadError Traceback (most recent call last)
in () ----> 1 pdf = pypdf.PdfFileReader('failing_sample.pdf') /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in init(self, stream, strict, warndest, overwriteWarnings, debug) 1311 1312 self.stream = stream -> 1313 self._parsePdfFile(stream) 1314 1315 def repr(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _parsePdfFile(self, stream) 2369 elif self.strict: 2370 raise PdfReadError( -> 2371 "Unknown xref type: %s" % xrefType 2372 ) 2373
PdfReadError: Unknown xref type: 255
In [3]: pdf = pypdf.PdfFileReader('failing_sample.pdf', strict=False)
In [4]: pdf.getPage(0) PdfReadWarning: Object 1 0 not defined. [pdf.py:2076]
PdfReadError Traceback (most recent call last)
in () ----> 1 pdf.getPage(0) /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getPage(self, pageNumber) 1461 # Ensure that we're not trying to access an encrypted PDF 1462 if self._flattenedPages is None: -> 1463 self._flatten() 1464 1465 return self._flattenedPages[pageNumber]
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in _flatten(self, pages, inherit, indirectRef) 1814 if pages is None: 1815 self._flattenedPages = [] -> 1816 catalog = self._trailer["/Root"].getObject() 1817 pages = catalog["/Pages"].getObject() 1818
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in getitem(self, key) 570 571 def getitem(self, key): --> 572 return dict.getitem(self, key).getObject() 573 574 def getXmpMetadata(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/generic.py in getObject(self) 198 199 def getObject(self): --> 200 return self.pdf.getObject(self).getObject() 201 202 def repr(self):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pypdf/pdf.py in getObject(self, ref) 2077 ) 2078 raise PdfReadError( -> 2079 "Could not find object (%d, %d)" % (ref.idnum, ref.generation) 2080 ) 2081
PdfReadError: Could not find object (1, 0)
vs. with the pre-#14 https://github.com/claird/PyPDF4/pull/14 version:
In [1]: import PyPDF4 as pypdf
In [2]: pdf = pypdf.PdfFileReader('failing_sample.pdf')
In [3]: pdf.getPage(0).keys() Out[3]: dict_keys(['/Resources', '/MediaBox', '/Type', '/Parent', '/Contents', '/Rotate'])
If you like, I may be able to share some prototype examples with you directly; I suspect that this is now a fairly small fix to the monumental amount of work you've already done. Please get in touch with me directly if you're interested.
In the meantime, I will start going through what you have done for #14 https://github.com/claird/PyPDF4/pull/14 and see if I can figure out the failure point as well as I really want to migrate to the newer version--when it works, it works sooo good! Excellent job!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/claird/PyPDF4/issues/11#issuecomment-427513141, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbN9PVsD4vzF4kgjo8CieEZLGwoRNL3ks5uh9n_gaJpZM4WWcpx .
Initially I suspected whether some of the filters in filters.py
might be causing the problem, and that probably was a most far fetched hypothesis. I just took a few seconds of inspection into the stack trace to note an "xref" (very poor nomenclature, to be changed) equal to 255
. Neither PyPDF nor any other PDF software can do anything about it AFAIK, judging from the 2008 ISO 32000 standard.
My suggestion: set strict=False
in PdfFileReader
__init__()
and see what happens. The nature of this "problem" stands out very clearly to me.
Paragraph 7.5.8.3
has a relevant excerpt from the standard. We do not interpret unrecognized Cross-Reference Stream types as references to the null value, but report them.
Yup, that is correct. Notice, however, that I did try using strict=False
in the __init__
method which led to the error of being unable to flatten the PDF document.
Side rant, I find it very frustrating that Adobe has made their product so robust that tools that create totally non-compliant documents still manage to render as expected; makes tasks like this needlessly difficult! :-)
EDIT: also note that I am able to brute-force access to the document via the pre-#14 version of PyPDF
Note that the changes suggested here:
https://stackoverflow.com/questions/45978113/pypdf2-write-doesnt-work-on-some-pdf-files-python-3-5-1/52687771#52687771
fixed the problem for me. The line numbers have changed, but the changes still "fit".
The suggested changes are proposed by me, let me know if a PR is needed, although that might break some other unknown functionalities.