paperless-ngx icon indicating copy to clipboard operation
paperless-ngx copied to clipboard

[BUG] File altered at DOCUMENT_WORKING_PATH by PAPERLESS_PRE_CONSUME_SCRIPT is not consumed properly

Open lukasz-lobocki opened this issue 1 year ago • 6 comments

Description

I am decrypting pdfs with PRE_CONSUME_SCRIPT. It was working before 2.1.1 (but not sure of the exact moment). Currently, it partially does not behave as expected.

Decrypted file is presented correctly in thumbnail:

Screenshot from 2023-12-30 19-09-17

But the same file incorrectly remain encrypted in viewer.

Screenshot from 2023-12-30 19-06-56

I am guessing that thumbnail is created after decryption... but the file itself is archived before decryption.

Steps to reproduce

More info @ https://github.com/lukasz-lobocki/pdf_decrypt_retrieve_attachments

def unlock_pdf(s_file_path: str, w_file_path: str) -> None:
    passw = None
    print("Reading passwords from {pf}".format(pf=pass_file_path))
    with open(pass_file_path, "r") as f:
        passwords = f.readlines()
    for p in passwords:
        passw = p.strip()
        try:
            with pikepdf.open(w_file_path, password=passw, allow_overwriting_input=True) as pdf:
                print("Unlocked succesfully with password {f}***{l}".format(f=passw[0], l=passw[-1]))
                pdf.save(w_file_path)
                #pdf.save(s_file_path)
                print("Unlocked working file replaced {}".format(w_file_path))
                break
        except pikepdf.PasswordError:
            print("Password {f}***{l} is not working".format(f=passw[0], l=passw[-1]))
            continue
    if passw is None:
        print("Empty password file {pf}".format(pf=pass_file_path))

Webserver logs

[2023-12-30 18:57:48,263] [INFO] [paperless.consumer] Decrypting file /tmp/paperless/paperless-ngxlys5ck3z/2019-02-28 Drive100.pdf

[2023-12-30 18:57:48,264] [INFO] [paperless.consumer] Reading passwords from /usr/src/paperless/scripts/passwords.txt

[2023-12-30 18:57:48,265] [INFO] [paperless.consumer] Unlocked succesfully with password D***!

[2023-12-30 18:57:48,265] [INFO] [paperless.consumer] Unlocked working file replaced /tmp/paperless/paperless-ngxlys5ck3z/2019-02-28 Drive100.pdf

Browser logs

No response

Paperless-ngx version

2.2.1

Host OS

Debian 6.1.67-1 (2023-12-12) x86_64

Installation method

Docker - official image

Browser

Firefox

Configuration changes

in docker-compose.yml volume: - /home/la_lukasz/paperless-ngx/scripts:/usr/src/paperless/scripts added. In docker-compose.env PAPERLESS_PRE_CONSUME_SCRIPT=/usr/src/paperless/scripts/pre-consumption.py added.

Other

No response

Please confirm the following

  • [X] I believe this issue is a bug that affects all users of Paperless-ngx, not something specific to my installation.
  • [X] I have already searched for relevant existing issues and discussions before opening this report.
  • [X] I have updated the title field above with a concise description.

lukasz-lobocki avatar Dec 30 '23 18:12 lukasz-lobocki

I dont think anything changed with respect to using a pre-consume script recently, but thats not so much my end of things.

From your screenshot above you show the popup preview, one thing that changed there is it now uses the "built-in" PDF viewer, whereas before the popup always used the browser preview. Indeed, the lock icon indicates a password-protected PDF (for the sake of usability the popup doesnt let you put in a password at all). So I just want to make sure, in the preview area of the "edit" page (/documents/ID/) it's also prompting you for a password?

Again, Im not the best person to comment on the script part, just want to get an idea if the frontend is at all relevant here.

shamoon avatar Dec 30 '23 19:12 shamoon

If anything, we fixed an issue here. We copy the original file into paperless, but parse from the working copy. The intent is always to archive the original, so that is what has been done here.

stumpylog avatar Dec 30 '23 19:12 stumpylog

This change was made in #4781, the original path is now stored, not an maybe altered version of the original. I don't see this as a bug, but rather the storage of a possibly altered file.

The workaround is decryption before consumption, either manually or watching another directory and then moving to the consumption directory.

stumpylog avatar Dec 30 '23 19:12 stumpylog

Thank you.

So this part of https://docs.paperless-ngx.com/advanced_usage/#pre-consume-script is misleading. Modification is no longer an option, contrary to what the below paragraph suggests.

Pre-consume scripts which modify the document should only change the DOCUMENT_WORKING_PATH file or a second consume task may be triggered, leading to failures as two tasks work on the same document path

Another idea, based on your @stumpylog suggestion to decrypt before consumption. Is it allowed to delete DOCUMENT_SOURCE_PATH file or DOCUMENT_WORKING_PATH file from within PAPERLESS_PRE_CONSUME_SCRIPT? In order to "sneaky" drop the original file and substitute it with new decrypted one...

PS. Yes @shamoon , the preview area of the "edit" page (/documents/ID/) it's also prompting for password.

lukasz-lobocki avatar Dec 30 '23 20:12 lukasz-lobocki

There really isn't a good way for a pre or post script to modify the file. Changing the original triggers a new consume, which then fails is interesting ways when the original consume completes. Changing a document in post means the checksum no longer matches, which produces warnings. Changing the working copy allows parsing to happen, but it is ultimately discarded.

But we want to store the original document, since this is ultimately trying to be an archive solution. I'll keep this open and think some more about a solution.

stumpylog avatar Dec 30 '23 21:12 stumpylog

@stumpylog I had a usecase, where my PAPERLESS_PRE_CONSUME_SCRIPT did cut white space from a jpeg to make the pdf smaller and more pleasant to look at. I do not use this at the moment, but I would like to know that I could again use it in the future. Just as before, PAPERLESS_PRE_CONSUME_SCRIPT should be able to alter a copy of the original file which is then archived. If my understanding is correct, I would therefore second @lukasz-lobocki, in that he described an actual bug, not a feature. Also, the previous functionality is still mentioned in the documentation..

szaiser avatar Jan 04 '24 13:01 szaiser

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

github-actions[bot] avatar Feb 06 '24 03:02 github-actions[bot]