workflow_ocr icon indicating copy to clipboard operation
workflow_ocr copied to clipboard

OCR Overwrites digitally signed files

Open farhills opened this issue 1 year ago • 10 comments

Describe the bug

Files with a digital signature are being overwritten, deleting the digital seal (leaving just the image of the signature)

System

  • App version: 1.27.1
  • Nextcloud version: 27.0.1
  • PHP version: 8.2.8
  • Environment: Linuxserver docker container on unraid
  • ocrmypdf version: 14.1.0

How to reproduce

Steps to reproduce the behavior:

  1. create a pdf and apply a digital signature
  2. allow cron to run
  3. signature is deleted from document

Screenshots

image

Additional context

I've deleted the OCR rule for 'file modified', but in my typical workflow I print to PDF and immediately sign, so the files are captured in the queue and often don't get processed until after they've been signed.

It would be great if we could detect if a file is signed and skip it.

I've also commented on ocrmypdf #1040 as I recognize this issue may be more appropriately directed toward that project.

https://github.com/ocrmypdf/OCRmyPDF/issues/1040

farhills avatar Aug 11 '23 19:08 farhills

Hi @farhills and thanks for reporting this. Indeed I'm afraid you're right and this issue seems rather be related to ocrmypdf than to this app. The app itself doesn't handle the contents of the converted files except that it creates a new file version in NC with the result of the ocrmypdf conversion.

As far as I understand, technically the tool cannot preserve a valid digital pdf signature since it changes the documents content which invalidates any signature.

One way would be to tell ocrmypdf to again sign the document after the process (which is currently not possible AFAIK). If it's possible to check if a pdf is signed or not, we could also add an option "Skip signed pdf" to the app itself.

If you're able to sign your documents via CLI, you could also try to chain the OCR workflow with the external command workflow

R0Wi avatar Aug 11 '23 21:08 R0Wi

Thanks, as I wrote the issue I realized it would be the underlying library that has to deal with this. My professional organization has teamed up with a very closed-source certificate authority, there's no CLI option for signing. The process is heavily locked down.

I'll mark the issue as closed. If ocrmypdf adds a new switch '--skip-signed' or similar I'll open a new feature request here to tap into that functionality. Thanks!

farhills avatar Aug 11 '23 23:08 farhills

And just like that it's been fixed! OCRmyPDF, V14.4.0 and later will preserve digital signatures by default. Earlier versions clobber the signature without warning.

OCRmyPDF cannot preserve digital signatures in PDFs and also add to OCR to them.
By default, it will refuse to modify a signed PDF regardless of other settings. You can
override this behavior with ``--invalidate-digital-signatures``; as the name suggests,
any digital signatures will be invalidated.

OCRmyPDF cannot open documents that are encrypted with a digital certificate.

Versions of OCRmyPDF prior to 14.4.0 would invalidate existing digital signatures
without warning.

https://github.com/ocrmypdf/OCRmyPDF/commit/a371655052a488c59b82ae659642bc76f57c1399

farhills avatar Aug 14 '23 17:08 farhills

Thanks for letting us know! Sounds like we might want to introduce an additional switch for the digital signature behaviour.

R0Wi avatar Aug 14 '23 19:08 R0Wi

In my use case, digitally signed documents should never be changed, even if the document OCR is imperfect or incomplete. These files represent final outputs, and need to be retained unmodified.

When OCR is complete, a new file is saved, so the digital signature is lost (opposed to editing a signed file where the signature is retained, but made invalid due to the edit).

I would, at most, add the --invalidate-digital-signatures flag only for the 'Force OCR' option. Safer for the user, but a bit more work for you, would be an opt-in UI checkbox 'include digitally signed files'. Either way, there needs to be a warning to the user that the signed file will be replaced by the OCR output, and the signature will be permanently lost.

farhills avatar Aug 14 '23 20:08 farhills

Some additional feedback - the app notifications need to be updated to catch and handle the no-output condition when processing a digitally signed file. IMO this can be done silently. Currently it throws an error in the browser and desktop client.

image

CLI output for the same file:

root@5ea6340167e7:/data/xxxxxxxxxxxxxx/files/Misc-JD/OCR-Testing# ocrmypdf 'Digital Signature Sample.pdf' sigoutput.pdf
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document,         _sync.py:432
invalidating the signature.

farhills avatar Aug 16 '23 00:08 farhills

Good catch, thanks for the hint. I think we need to properly recognize this situation and don't throw an error but instead logging an information for example.

R0Wi avatar Aug 16 '23 04:08 R0Wi

Hello,

In my use case, most of the time I would not care about the original digital signature but do care about proper OCR. I do understand that an altered file cannot retain original signature and nonetheless want OCR.

But I would not use force OCR because I do care not to destroy original (probably best) OCR.

It would be great if it was an option like the Remove background option, because it perfectly make sense to accept possible deletion of digital signature in modes like skip text.

image

yeupou avatar Oct 28 '23 14:10 yeupou

Current implementation plan would be like the following:

  • Add a new switch "Invalidate Digital Signatures" to the per-workflow settings with appropriate help text
  • If ocrmypdf version is < 14.4.0
    • and switch is not set: do not add any CLI argument but add a warning (like currently implemented) if a signed file gets overwritten
    • and switch is set: same like above but try to not log a warning (if possible... we need to check if there is a way to determine this error properly)
  • If version is >= 14.4.0: add the CLI argument is switch is set, otherwise don't add it. Log any errors/warnings.
  • We'd need a parser for the output of ocrmypdf --version

R0Wi avatar Oct 29 '23 21:10 R0Wi

please see my comments here https://github.com/ocrmypdf/OCRmyPDF/issues/1003#issuecomment-2216803297 If the process encounters a digitally signed PDF it just could make a copy and process the copy marking the file with a meaningful tag "OCR-no-signature" or similar. I think I do not need to emphasis how important it is to OCR-scan digitally signed documents for search purpose.

ferdiga avatar Jul 09 '24 07:07 ferdiga