pdfrw
pdfrw copied to clipboard
PdfReader reader fails in decryption
Hello
I am using pdfrw to read an encrypted file. The file does not need a password to open it and I can view it in Adobe Reader. When opening with PdfReader I am getting an exception.
$ python
Python 2.7.10 (default, Jan 30 2019, 03:22:04)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfrw
>>> pdfrw.PdfReader('Encrypted.pdf', decrypt=True, decompress=True)
[WARNING] tokens.py:221 Did not find PDF object (197, 0) (line=2076, col=1, token='startxref')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/prometheus/pdfrw/lib/python2.7/site-packages/pdfrw/pdfreader.py", line 645, in __init__
self._parse_encrypt_info(source, password, trailer)
File "/home/prometheus/pdfrw/lib/python2.7/site-packages/pdfrw/pdfreader.py", line 499, in _parse_encrypt_info
key = crypt.create_key(password, trailer)
File "/home/prometheus/pdfrw/lib/python2.7/site-packages/pdfrw/crypt.py", line 31, in create_key
key_size = int(doc.Encrypt.Length or 40) // 8
AttributeError: 'NoneType' object has no attribute 'Length'
It seems like that the issue is being cause by not being able to find the object (197, 0) even though it is present in the pdf file. Object (197, 0) contains the details of the encryption.
Any help in solving this issue is greatly appreciated. Thanks
(Edit: Sample pdf can be downloaded from https://www.proofpoint.com/us/resources/white-papers/who-moved-my-data)
I have done a fix for this issue. Please check if it is correct. Thanks.
Note: I could not run the unit tests successfully even without this change.
$ git diff
diff --git a/pdfrw/pdfreader.py b/pdfrw/pdfreader.py
index c2ae030..621fff4 100644
--- a/pdfrw/pdfreader.py
+++ b/pdfrw/pdfreader.py
@@ -614,8 +614,8 @@ class PdfReader(PdfDict):
# Find all the xref tables/streams, and
# then deal with them backwards.
xref_list = []
+ source.obj_offsets = {}
while 1:
- source.obj_offsets = {}
trailer, is_stream = self.parsexref(source)
prev = trailer.Prev
if prev is None: