paperless-ngx
paperless-ngx copied to clipboard
[BUG] File altered at DOCUMENT_WORKING_PATH by PAPERLESS_PRE_CONSUME_SCRIPT is not consumed properly
Description
I am decrypting pdfs with PRE_CONSUME_SCRIPT. It was working before 2.1.1 (but not sure of the exact moment). Currently, it partially does not behave as expected.
Decrypted file is presented correctly in thumbnail:
But the same file incorrectly remain encrypted in viewer.
I am guessing that thumbnail is created after decryption... but the file itself is archived before decryption.
Steps to reproduce
More info @ https://github.com/lukasz-lobocki/pdf_decrypt_retrieve_attachments
def unlock_pdf(s_file_path: str, w_file_path: str) -> None:
passw = None
print("Reading passwords from {pf}".format(pf=pass_file_path))
with open(pass_file_path, "r") as f:
passwords = f.readlines()
for p in passwords:
passw = p.strip()
try:
with pikepdf.open(w_file_path, password=passw, allow_overwriting_input=True) as pdf:
print("Unlocked succesfully with password {f}***{l}".format(f=passw[0], l=passw[-1]))
pdf.save(w_file_path)
#pdf.save(s_file_path)
print("Unlocked working file replaced {}".format(w_file_path))
break
except pikepdf.PasswordError:
print("Password {f}***{l} is not working".format(f=passw[0], l=passw[-1]))
continue
if passw is None:
print("Empty password file {pf}".format(pf=pass_file_path))
Webserver logs
[2023-12-30 18:57:48,263] [INFO] [paperless.consumer] Decrypting file /tmp/paperless/paperless-ngxlys5ck3z/2019-02-28 Drive100.pdf
[2023-12-30 18:57:48,264] [INFO] [paperless.consumer] Reading passwords from /usr/src/paperless/scripts/passwords.txt
[2023-12-30 18:57:48,265] [INFO] [paperless.consumer] Unlocked succesfully with password D***!
[2023-12-30 18:57:48,265] [INFO] [paperless.consumer] Unlocked working file replaced /tmp/paperless/paperless-ngxlys5ck3z/2019-02-28 Drive100.pdf
Browser logs
No response
Paperless-ngx version
2.2.1
Host OS
Debian 6.1.67-1 (2023-12-12) x86_64
Installation method
Docker - official image
Browser
Firefox
Configuration changes
in docker-compose.yml volume: - /home/la_lukasz/paperless-ngx/scripts:/usr/src/paperless/scripts added. In docker-compose.env PAPERLESS_PRE_CONSUME_SCRIPT=/usr/src/paperless/scripts/pre-consumption.py added.
Other
No response
Please confirm the following
- [X] I believe this issue is a bug that affects all users of Paperless-ngx, not something specific to my installation.
- [X] I have already searched for relevant existing issues and discussions before opening this report.
- [X] I have updated the title field above with a concise description.
I dont think anything changed with respect to using a pre-consume script recently, but thats not so much my end of things.
From your screenshot above you show the popup preview, one thing that changed there is it now uses the "built-in" PDF viewer, whereas before the popup always used the browser preview. Indeed, the lock icon indicates a password-protected PDF (for the sake of usability the popup doesnt let you put in a password at all). So I just want to make sure, in the preview area of the "edit" page (/documents/ID/) it's also prompting you for a password?
Again, Im not the best person to comment on the script part, just want to get an idea if the frontend is at all relevant here.
If anything, we fixed an issue here. We copy the original file into paperless, but parse from the working copy. The intent is always to archive the original, so that is what has been done here.
This change was made in #4781, the original path is now stored, not an maybe altered version of the original. I don't see this as a bug, but rather the storage of a possibly altered file.
The workaround is decryption before consumption, either manually or watching another directory and then moving to the consumption directory.
Thank you.
So this part of https://docs.paperless-ngx.com/advanced_usage/#pre-consume-script is misleading. Modification is no longer an option, contrary to what the below paragraph suggests.
Pre-consume scripts which modify the document should only change the DOCUMENT_WORKING_PATH file or a second consume task may be triggered, leading to failures as two tasks work on the same document path
Another idea, based on your @stumpylog suggestion to decrypt before consumption. Is it allowed to delete DOCUMENT_SOURCE_PATH file or DOCUMENT_WORKING_PATH file from within PAPERLESS_PRE_CONSUME_SCRIPT? In order to "sneaky" drop the original file and substitute it with new decrypted one...
PS. Yes @shamoon , the preview area of the "edit" page (/documents/ID/) it's also prompting for password.
There really isn't a good way for a pre or post script to modify the file. Changing the original triggers a new consume, which then fails is interesting ways when the original consume completes. Changing a document in post means the checksum no longer matches, which produces warnings. Changing the working copy allows parsing to happen, but it is ultimately discarded.
But we want to store the original document, since this is ultimately trying to be an archive solution. I'll keep this open and think some more about a solution.
@stumpylog I had a usecase, where my PAPERLESS_PRE_CONSUME_SCRIPT did cut white space from a jpeg to make the pdf smaller and more pleasant to look at. I do not use this at the moment, but I would like to know that I could again use it in the future. Just as before, PAPERLESS_PRE_CONSUME_SCRIPT should be able to alter a copy of the original file which is then archived. If my understanding is correct, I would therefore second @lukasz-lobocki, in that he described an actual bug, not a feature. Also, the previous functionality is still mentioned in the documentation..
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.