Menextract2pdf zlib.error: Error -3 while decompressing data: incorrect header check

Hi, I am a long-time mac Mendeley user, but I have become extremely fed up with the various bugs and limitations of Mendeley so I have decided to try to switch to Zotero. The problem is I have 10 years of annotated (and highlighted) PDFs I cannot lose in the conversion process. I have tried running the .sh from my macOs Sierra terminal but it does not work. the only command that starts some sort of process is:

python3 menextract2pdf.py mydatabase.sqlite mypdffolder/ --overwrite

The overwriting of pdfs works for a while and about a third of my 2800 files get modified with the highlighting as it should. but then the process stops and I get the following error message:

Traceback (most recent call last): File "menextract2pdf.py", line 193, in mendeley2pdf(fn, dir_pdf) File "menextract2pdf.py", line 177, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "menextract2pdf.py", line 156, in processpdf inpdf._flatten() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1506, in _flatten pages = catalog["/Pages"].getObject() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 516, in getitem return dict.getitem(self, key).getObject() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 178, in getObject return self.pdf.getObject(self).getObject() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1593, in getObject retval = self._getObjectFromStream(indirectReference) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1543, in getObjectFromStream streamData = BytesIO(b(objStm.getData())) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 841, in getData decoded._data = filters.decodeStreamData(self) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData data = FlateDecode.decode(data, stream.get("/DecodeParms")) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 111, in decode data = decompress(data) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 49, in decompress return zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check

Thanks in advance for your help! Max

Oct 04 '18 16:10 ammoniac1984

Hi Max, this looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting

print(fn)

before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py. If you could share the file I can try to debug. Cheers Jochen

Oct 08 '18 07:10 cycomanic

Hi Jochen

Thank you so much for addressing my problem! I appreciate it very much. I added "print(fn)” before line 177 and I ran the .py file again and I get the same error message. I do not see a difference in the printed output in my terminal window.

Here is a link to a fairly recent backup of my database:

I had upgraded to the newest version of Mendeley which encrypted the database so I had to look for older backups from the spring before the update. This database file is the copy I had on my office computer. Not the same one with which I worked last week when I posted this query on GitHub, but running menextract2pdf on this database produces the same error as the other version I have at home. The only difference is that the script does not seem to process the bibliographic entries in the same order, so it appears as if it does not stop on the same entry (but that might not be the case and just me not understanding how the script works).

Thanks again!

Cheers!

Maxime

On 8 Oct 2018, at 08:51, Jochen Schröder [email protected] wrote:

Hi Max, this looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting

print(fn) before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py. If you could share the file I can try to debug. Cheers Jochen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fPJduZymxgOyHIbCXO4eJ-DfyvOjks5uiwPrgaJpZM4XIcNB.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: Hi Max,\r\nthis looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting \r\npython\r\nprint(fn)\r\n\r\nbefore the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py. If you could share the file I can try to debug.\r\nCheers\r\nJochen"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueComment",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12,\n"IssueComment": "{{IssueComment.value}}"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueClose",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "MuteNotification",\n"threadId": 388088641\n}" } ], "themeColor": "26292E" } ]

Oct 08 '18 14:10 ammoniac1984

Hi Maxime,

the print statement should give us the filename of the offending pdf file, not fix the error. Can you copy paste the full error, I suspect the filename simply got lost in all the output. Unfortunately the database does not help as the error is related to one of the PDF files.

Oct 09 '18 06:10 cycomanic

Hi Jochen,

Thank you for your reply. Here is the complete printout copy/pasted from my terminal window. Can you see something in there?

Thanks!

Maxime

...

Oct 09 '18 09:10 ammoniac1984

So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?

Oct 09 '18 10:10 cycomanic

Hi, This?

/Users/maxime/Library/Application Support/Mendeley Desktop/Downloaded/Gingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf Traceback (most recent call last): File "/Users/maxime/Downloads/Menextract2pdf-master/menextract2pdf.py", line 185, in mendeley2pdf(fn, dir_pdf) File "/Users/maxime/Downloads/Menextract2pdf-master/menextract2pdf.py", line 169, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "/Users/maxime/Downloads/Menextract2pdf-master/menextract2pdf.py", line 147, in processpdf inpdf._flatten() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1506, in _flatten pages = catalog["/Pages"].getObject() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/generic.py", line 516, in getitem return dict.getitem(self, key).getObject() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/generic.py", line 178, in getObject return self.pdf.getObject(self).getObject() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1593, in getObject retval = self._getObjectFromStream(indirectReference) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1543, in getObjectFromStream streamData = BytesIO(b(objStm.getData())) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/generic.py", line 841, in getData decoded._data = filters.decodeStreamData(self) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData data = FlateDecode.decode(data, stream.get("/DecodeParms")) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/filters.py", line 111, in decode data = decompress(data) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/PyPDF2/filters.py", line 49, in decompress return zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check

Le 9 oct. 2018 à 11:38, Jochen Schröder [email protected] a écrit :

So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fHULQlWcgtaqqiACTARy6-xM_oGtks5ujHykgaJpZM4XIcNB.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueComment",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12,\n"IssueComment": "{{IssueComment.value}}"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueClose",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "MuteNotification",\n"threadId": 388088641\n}" } ], "themeColor": "26292E" } ]

Oct 09 '18 16:10 ammoniac1984

Could you share the file: ingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf That's seems to be the one causing the issues

Oct 14 '18 14:10 cycomanic

Hi, Yes, here it is: https://www.dropbox.com/s/3zsa41sxd360hjf/Gingras%20-%202010%20-%20Naming%20without%20Necessity%20On%20the%20Genealogy%20and%20Uses%20of%20the%20Label%20%E2%80%98Historical%20Epistemology%E2%80%99.pdf?dl=0 https://www.dropbox.com/s/3zsa41sxd360hjf/Gingras%20-%202010%20-%20Naming%20without%20Necessity%20On%20the%20Genealogy%20and%20Uses%20of%20the%20Label%20%E2%80%98Historical%20Epistemology%E2%80%99.pdf?dl=0

Just an observation. When I ran the script on a different version of the database (at work), it would block at another file. I have not been able to understand in which order does the script deal with the files.

Thanks again!!

Maxime

Le 14 oct. 2018 à 15:06, Jochen Schröder [email protected] a écrit :

Could you share the file: ingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf That's seems to be the one causing the issues

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fCmajYOzXJazW6n9ae7KMXooXOVCks5uk0TLgaJpZM4XIcNB.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: Could you share the file: \r\ningras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf\r\nThat's seems to be the one causing the issues\r\n"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueComment",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12,\n"IssueComment": "{{IssueComment.value}}"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueClose",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "MuteNotification",\n"threadId": 388088641\n}" } ], "themeColor": "26292E" } ]

Oct 14 '18 14:10 ammoniac1984

I am getting the same error on a pdf. I can open in acrobat and in mendeley just fine. I can also just manually export this file with the annotations myself through mendeley.

Is it possible to somehow just keep going through all the files and then I can manually go export the ones that fail manually? How do I get it to not crash on this error but just skip this file?

Thanks!

Oct 08 '19 14:10 folofjc

Also, it looks like for both me and @ammoniac1984, it is happening when pypdf2 thinks the file is encrypted. There is a comment in menextract2pdf.py that says that overriding the encryption worked in the one case you saw. Maybe that is not working for us?

Oct 08 '19 16:10 folofjc

So the pdf that it is hiccuping on for me opens in Adobe Acrobat and Evince just fine. However, when I tried to open it with pdftk, it said that it had a password protection and would not open it. Here is what the security details look like in Adobe

So it says that it is encrypted, but opens it just fine. My way around it was to simply make a LaTeX file that simply includes this file and then writes it out. This file is not encrypted. Here is what the file made from LaTeX looks like in Adobe:

I then screwed up by trying to add this to Mendeley and delete the other file, but that deleted all my annotations from the database. I guess the annotations are tied to the specific file?

Luckily I had a backup. Unfortunately, it sync'd to Mendeley's servers first. So I had to disconnect from the internet, copy over my backup of the database, open Mendeley, make a backup, then close Mendeley, reconnect to the internet, open it (at which case it sync'd and re-deleted my annotations). Then I restored (which deleted the database both locally and on the servers), which brought back my annotations (and the "encrypted" file). So then I closed Mendeley before it could sync the new files. Then I replaced the pdf with the unencrypted one, started it again, and it appears to be okay. Then it sync'd the backup (but with the unencrypted pdf) back to their servers. But I think I am okay now.

Oct 08 '19 18:10 folofjc

I think this issue can be marked as closed as the workaround suggested by @folofjc works. i.e. replace the file with "Password Security" with "No Security" works. What I did (on MacOS) was to print the file as a PDF to desktop (now it had "None" as security listed in file properties in Finder). Then I overwrote the old file with this new file and ran the script again and it worked.

Jul 21 '20 12:07 dchakro

I don't know how it works on MacOS, but on Windows when you print to PDF it makes it an image, so you would lose any "text as text." The nice thing about going through LaTeX is that if it is text, it keeps it as text.

Jul 21 '20 12:07 folofjc

Menextract2pdf Menextract2pdf copied to clipboard

zlib.error: Error -3 while decompressing data: incorrect header check

Here is a link to a fairly recent backup of my database:

Menextract2pdf
Menextract2pdf copied to clipboard