workflow_ocr icon indicating copy to clipboard operation
workflow_ocr copied to clipboard

PDF with text layer errors if OCR_MODE_SKIP_FILE is set

Open XueSheng-GIT opened this issue 9 months ago • 6 comments

Describe the bug

Since https://github.com/R0Wi-DEV/workflow_ocr/pull/288, OCR_MODE_SKIP_FILE results in error instead of warning if pdf contains a text layer (no ocr required). Generally, I would not expect such an error for this scenario.

Warning LOG (Before above pull):

OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr

Error LOG (After above pull):

OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr

System

  • App version: 1.31
  • Nextcloud version: 30.0.5
  • PHP version: 8.3
  • Environment: native Apache
  • ocrmypdf version: 15.2.0+dfsg1

How to reproduce

Steps to reproduce the behavior:

  1. Go to workflow setting
  2. Create OCR flow with "skip file" setting (see screenshot below)
  3. Upload pdf file with text layer
  4. Get error message

Screenshots

Skip file setting:

Image

Server log

{"reqId":"NC6PrjvDvA6A6k68Wv8q","level":3,"time":"2025-02-05T21:49:04+01:00","remoteAddr":"","user":"--","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr","userAgent":"--","version":"30.0.5.1","exception":{"Exception":"OCA\\WorkflowOcr\\Exception\\OcrNotPossibleException","Message":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr","Code":0,"Trace":[{"file":"/var/www/nextcloud/apps/workflow_ocr/lib/Service/OcrService.php","line":152,"function":"ocrFile","class":"OCA\\WorkflowOcr\\OcrProcessors\\OcrMyPdfBasedProcessor","type":"->"},{"file":"/var/www/nextcloud/apps/workflow_ocr/lib/Service/OcrService.php","line":128,"function":"runOcrProcess","class":"OCA\\WorkflowOcr\\Service\\OcrService","type":"->"},{"file":"/var/www/nextcloud/apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php","line":57,"function":"runOcrProcessWithJobArgument","class":"OCA\\WorkflowOcr\\Service\\OcrService","type":"->"},{"file":"/var/www/nextcloud/lib/public/BackgroundJob/Job.php","line":61,"function":"run","class":"OCA\\WorkflowOcr\\BackgroundJobs\\ProcessFileJob","type":"->"},{"file":"/var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php","line":43,"function":"start","class":"OCP\\BackgroundJob\\Job","type":"->"},{"file":"/var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php","line":29,"function":"start","class":"OCP\\BackgroundJob\\QueuedJob","type":"->"},{"file":"/var/www/nextcloud/core/Command/Background/Job.php","line":71,"function":"execute","class":"OCP\\BackgroundJob\\QueuedJob","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Command/Command.php","line":326,"function":"execute","class":"OC\\Core\\Command\\Background\\Job","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Application.php","line":1078,"function":"run","class":"Symfony\\Component\\Console\\Command\\Command","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Application.php","line":324,"function":"doRunCommand","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Application.php","line":175,"function":"doRun","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/lib/private/Console/Application.php","line":183,"function":"run","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/console.php","line":87,"function":"run","class":"OC\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/occ","line":11,"args":["/var/www/nextcloud/console.php"],"function":"require_once"}],"File":"/var/www/nextcloud/apps/workflow_ocr/lib/OcrProcessors/OcrMyPdfBasedProcessor.php","Line":76,"message":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr","exception":[],"CustomMessage":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr"},"id":"67a3d2209a6bc"}

Browser log

n/a

Additional context

n/a

XueSheng-GIT avatar Feb 05 '25 21:02 XueSheng-GIT

Thanks for reporting this. Since now the capturing of the exit code works properly, we should take into consideration the different exit codes defined here https://ocrmypdf.readthedocs.io/en/latest/advanced.html#return-code-policy

R0Wi avatar Feb 05 '25 21:02 R0Wi

Just did some initial tests with Nextcloud 31 and workflow_ocr_backend and this issue also applies if the backend is in use.

XueSheng-GIT avatar Mar 18 '25 10:03 XueSheng-GIT

Hi @XueSheng-GIT , thanks again for your feedback. I can see that you already experimented with the code. We would love to see a new PR to fix this 👍

I would check this code here: https://github.com/R0Wi-DEV/workflow_ocr/blob/3fa3ad121e711b04c1faa17dac5f2d34364b9d00/lib/OcrProcessors/Local/OcrMyPdfBasedProcessor.php#L66

If the exit code is 6 (see exit code table of ocrmypdf mentioned above), we could just throw a new OcrResultEmptyException (instead of OcrNotPossibleException) with some meaningful message. This would log a warning and wouldn't touch the existing file.

If you need any help, please let me know 💯

R0Wi avatar Apr 06 '25 20:04 R0Wi

@R0Wi If I remeber correctly, I wasn't able to find any handling of error codes for the 'remote' case. Thus, I thought there is missing too much to create a uniform fix using the defined error codes. Any chance to get the error code from remote backend?

XueSheng-GIT avatar Apr 07 '25 04:04 XueSheng-GIT

@XueSheng-GIT you're absolutely right 😄 Just after I wrote the comment, I also checked the remote implementation and realized that the ErrorResult currently just contains an error message and no information about the exit code. I could offer to add this information to the API response of the Workflow OCR Backend App and let you know as soon as this is done. If you like you could then take over from there and do the adjustments in the workflow_ocr (frontend) app here? Of course I'm happy to support you wherever needed 👍

R0Wi avatar Apr 07 '25 04:04 R0Wi

@XueSheng-GIT I added the exitcode to both the backend and the client implementation. The latter is implemented in https://github.com/R0Wi-DEV/workflow_ocr/pull/302. You should now be able to use the (optional) exit-code by using getOcrMyPdfExitCode, in case an error is returned by the backend. Let me know if you need further assistance. Looking forward to reviewing your pullrequest 👍

R0Wi avatar Apr 09 '25 19:04 R0Wi

@R0Wi Thanks for adding the exit code for remote. I just updated my setup and was able to receive the exit code accordingly. I'll come back will a pull soon.

XueSheng-GIT avatar Apr 26 '25 05:04 XueSheng-GIT