Menextract2pdf
Menextract2pdf copied to clipboard
zlib.error: Error -3 while decompressing data: incorrect header check
Hi, I am a long-time mac Mendeley user, but I have become extremely fed up with the various bugs and limitations of Mendeley so I have decided to try to switch to Zotero. The problem is I have 10 years of annotated (and highlighted) PDFs I cannot lose in the conversion process. I have tried running the .sh from my macOs Sierra terminal but it does not work. the only command that starts some sort of process is:
python3 menextract2pdf.py mydatabase.sqlite mypdffolder/ --overwrite
The overwriting of pdfs works for a while and about a third of my 2800 files get modified with the highlighting as it should. but then the process stops and I get the following error message:
Traceback (most recent call last):
File "menextract2pdf.py", line 193, in
Thanks in advance for your help! Max
Hi Max, this looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting
print(fn)
before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons)
on line 177 in menextract2pdf.py. If you could share the file I can try to debug.
Cheers
Jochen
Hi Jochen
Thank you so much for addressing my problem! I appreciate it very much. I added "print(fn)” before line 177 and I ran the .py file again and I get the same error message. I do not see a difference in the printed output in my terminal window.
Here is a link to a fairly recent backup of my database:
I had upgraded to the newest version of Mendeley which encrypted the database so I had to look for older backups from the spring before the update. This database file is the copy I had on my office computer. Not the same one with which I worked last week when I posted this query on GitHub, but running menextract2pdf on this database produces the same error as the other version I have at home. The only difference is that the script does not seem to process the bibliographic entries in the same order, so it appears as if it does not stop on the same entry (but that might not be the case and just me not understanding how the script works).
Thanks again!
Cheers!
Maxime
On 8 Oct 2018, at 08:51, Jochen Schröder [email protected] wrote:
Hi Max, this looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting
print(fn) before the processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) on line 177 in menextract2pdf.py. If you could share the file I can try to debug. Cheers Jochen
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fPJduZymxgOyHIbCXO4eJ-DfyvOjks5uiwPrgaJpZM4XIcNB.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: Hi Max,\r\nthis looks like either the PDF file is somehow corrupt or it is a bug in PyPDF2. To figure out which file is causing the issue, try putting \r\n
python\r\nprint(fn)\r\n
\r\nbefore theprocesspdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons)
on line 177 in menextract2pdf.py. If you could share the file I can try to debug.\r\nCheers\r\nJochen"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueComment",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12,\n"IssueComment": "{{IssueComment.value}}"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueClose",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-427747297" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "MuteNotification",\n"threadId": 388088641\n}" } ], "themeColor": "26292E" } ]
Hi Maxime,
the print statement should give us the filename of the offending pdf file, not fix the error. Can you copy paste the full error, I suspect the filename simply got lost in all the output. Unfortunately the database does not help as the error is related to one of the PDF files.
Hi Jochen,
Thank you for your reply. Here is the complete printout copy/pasted from my terminal window. Can you see something in there?
Thanks!
Maxime
...
So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?
Hi, This?
/Users/maxime/Library/Application Support/Mendeley Desktop/Downloaded/Gingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf
Traceback (most recent call last):
File "/Users/maxime/Downloads/Menextract2pdf-master/menextract2pdf.py", line 185, in
Le 9 oct. 2018 à 11:38, Jochen Schröder [email protected] a écrit :
So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fHULQlWcgtaqqiACTARy6-xM_oGtks5ujHykgaJpZM4XIcNB.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: So it prints out the filenames (the /Uses/maxime/Library ... bits), however what you pasted is cut before the file in question. Could you just paste the error message at the end and the last filename that appears before it?"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueComment",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12,\n"IssueComment": "{{IssueComment.value}}"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueClose",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-428144347" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "MuteNotification",\n"threadId": 388088641\n}" } ], "themeColor": "26292E" } ]
Could you share the file: ingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf That's seems to be the one causing the issues
Hi, Yes, here it is: https://www.dropbox.com/s/3zsa41sxd360hjf/Gingras%20-%202010%20-%20Naming%20without%20Necessity%20On%20the%20Genealogy%20and%20Uses%20of%20the%20Label%20%E2%80%98Historical%20Epistemology%E2%80%99.pdf?dl=0 https://www.dropbox.com/s/3zsa41sxd360hjf/Gingras%20-%202010%20-%20Naming%20without%20Necessity%20On%20the%20Genealogy%20and%20Uses%20of%20the%20Label%20%E2%80%98Historical%20Epistemology%E2%80%99.pdf?dl=0
Just an observation. When I ran the script on a different version of the database (at work), it would block at another file. I have not been able to understand in which order does the script deal with the files.
Thanks again!!
Maxime
Le 14 oct. 2018 à 15:06, Jochen Schröder [email protected] a écrit :
Could you share the file: ingras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf That's seems to be the one causing the issues
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap00fCmajYOzXJazW6n9ae7KMXooXOVCks5uk0TLgaJpZM4XIcNB.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/cycomanic/Menextract2pdf","title":"cycomanic/Menextract2pdf","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/cycomanic/Menextract2pdf"}},"updates":{"snippets":[{"icon":"PERSON","message":"@cycomanic in #12: Could you share the file: \r\ningras - 2010 - Naming without Necessity On the Genealogy and Uses of the Label ‘Historical Epistemology’.pdf\r\nThat's seems to be the one causing the issues\r\n"}],"action":{"name":"View Issue","url":"https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234", "url": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [cycomanic/Menextract2pdf] zlib.error: Error -3 while decompressing data: incorrect header check (#12)", "sections": [ { "text": "", "activityTitle": "Jochen Schröder", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@cycomanic", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueComment",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12,\n"IssueComment": "{{IssueComment.value}}"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "IssueClose",\n"repositoryFullName": "cycomanic/Menextract2pdf",\n"issueId": 12\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/cycomanic/Menextract2pdf/issues/12#issuecomment-429629234" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n"commandName": "MuteNotification",\n"threadId": 388088641\n}" } ], "themeColor": "26292E" } ]
I am getting the same error on a pdf. I can open in acrobat and in mendeley just fine. I can also just manually export this file with the annotations myself through mendeley.
Is it possible to somehow just keep going through all the files and then I can manually go export the ones that fail manually? How do I get it to not crash on this error but just skip this file?
Thanks!
Also, it looks like for both me and @ammoniac1984, it is happening when pypdf2 thinks the file is encrypted. There is a comment in menextract2pdf.py that says that overriding the encryption worked in the one case you saw. Maybe that is not working for us?
So the pdf that it is hiccuping on for me opens in Adobe Acrobat and Evince just fine. However, when I tried to open it with pdftk, it said that it had a password protection and would not open it. Here is what the security details look like in Adobe
data:image/s3,"s3://crabby-images/81ea5/81ea570572f5d968c494a6c4e0c5b716f4af209a" alt="Annotation 2019-10-08 202043"
So it says that it is encrypted, but opens it just fine. My way around it was to simply make a LaTeX file that simply includes this file and then writes it out. This file is not encrypted. Here is what the file made from LaTeX looks like in Adobe:
data:image/s3,"s3://crabby-images/90641/906416c92a9fbe15ceeadbd7093c3257b2ec04d8" alt="Annotation 2019-10-08 202736"
I then screwed up by trying to add this to Mendeley and delete the other file, but that deleted all my annotations from the database. I guess the annotations are tied to the specific file?
Luckily I had a backup. Unfortunately, it sync'd to Mendeley's servers first. So I had to disconnect from the internet, copy over my backup of the database, open Mendeley, make a backup, then close Mendeley, reconnect to the internet, open it (at which case it sync'd and re-deleted my annotations). Then I restored (which deleted the database both locally and on the servers), which brought back my annotations (and the "encrypted" file). So then I closed Mendeley before it could sync the new files. Then I replaced the pdf with the unencrypted one, started it again, and it appears to be okay. Then it sync'd the backup (but with the unencrypted pdf) back to their servers. But I think I am okay now.
I think this issue can be marked as closed as the workaround suggested by @folofjc works. i.e. replace the file with "Password Security" with "No Security" works. What I did (on MacOS) was to print the file as a PDF to desktop (now it had "None" as security listed in file properties in Finder). Then I overwrote the old file with this new file and ran the script again and it worked.
I don't know how it works on MacOS, but on Windows when you print to PDF it makes it an image, so you would lose any "text as text." The nice thing about going through LaTeX is that if it is text, it keeps it as text.