workflow_ocr
workflow_ocr copied to clipboard
PDF with text layer errors if OCR_MODE_SKIP_FILE is set
Describe the bug
Since https://github.com/R0Wi-DEV/workflow_ocr/pull/288, OCR_MODE_SKIP_FILE results in error instead of warning if pdf contains a text layer (no ocr required). Generally, I would not expect such an error for this scenario.
Warning LOG (Before above pull):
OCRmyPDF succeeded with warning(s): PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
Error LOG (After above pull):
OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
System
- App version: 1.31
- Nextcloud version: 30.0.5
- PHP version: 8.3
- Environment: native Apache
ocrmypdfversion: 15.2.0+dfsg1
How to reproduce
Steps to reproduce the behavior:
- Go to workflow setting
- Create OCR flow with "skip file" setting (see screenshot below)
- Upload pdf file with text layer
- Get error message
Screenshots
Skip file setting:
Server log
{"reqId":"NC6PrjvDvA6A6k68Wv8q","level":3,"time":"2025-02-05T21:49:04+01:00","remoteAddr":"","user":"--","app":"workflow_ocr","method":"","url":"--","message":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr","userAgent":"--","version":"30.0.5.1","exception":{"Exception":"OCA\\WorkflowOcr\\Exception\\OcrNotPossibleException","Message":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr","Code":0,"Trace":[{"file":"/var/www/nextcloud/apps/workflow_ocr/lib/Service/OcrService.php","line":152,"function":"ocrFile","class":"OCA\\WorkflowOcr\\OcrProcessors\\OcrMyPdfBasedProcessor","type":"->"},{"file":"/var/www/nextcloud/apps/workflow_ocr/lib/Service/OcrService.php","line":128,"function":"runOcrProcess","class":"OCA\\WorkflowOcr\\Service\\OcrService","type":"->"},{"file":"/var/www/nextcloud/apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php","line":57,"function":"runOcrProcessWithJobArgument","class":"OCA\\WorkflowOcr\\Service\\OcrService","type":"->"},{"file":"/var/www/nextcloud/lib/public/BackgroundJob/Job.php","line":61,"function":"run","class":"OCA\\WorkflowOcr\\BackgroundJobs\\ProcessFileJob","type":"->"},{"file":"/var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php","line":43,"function":"start","class":"OCP\\BackgroundJob\\Job","type":"->"},{"file":"/var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php","line":29,"function":"start","class":"OCP\\BackgroundJob\\QueuedJob","type":"->"},{"file":"/var/www/nextcloud/core/Command/Background/Job.php","line":71,"function":"execute","class":"OCP\\BackgroundJob\\QueuedJob","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Command/Command.php","line":326,"function":"execute","class":"OC\\Core\\Command\\Background\\Job","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Application.php","line":1078,"function":"run","class":"Symfony\\Component\\Console\\Command\\Command","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Application.php","line":324,"function":"doRunCommand","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/console/Application.php","line":175,"function":"doRun","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/lib/private/Console/Application.php","line":183,"function":"run","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/console.php","line":87,"function":"run","class":"OC\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/occ","line":11,"args":["/var/www/nextcloud/console.php"],"function":"require_once"}],"File":"/var/www/nextcloud/apps/workflow_ocr/lib/OcrProcessors/OcrMyPdfBasedProcessor.php","Line":76,"message":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr","exception":[],"CustomMessage":"OCRmyPDF exited abnormally with exit-code 6 for file /admin/files/FileDrop/EasyOCR-TEST.pdf. Message: PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr"},"id":"67a3d2209a6bc"}
Browser log
n/a
Additional context
n/a
Thanks for reporting this. Since now the capturing of the exit code works properly, we should take into consideration the different exit codes defined here https://ocrmypdf.readthedocs.io/en/latest/advanced.html#return-code-policy
Just did some initial tests with Nextcloud 31 and workflow_ocr_backend and this issue also applies if the backend is in use.
Hi @XueSheng-GIT , thanks again for your feedback. I can see that you already experimented with the code. We would love to see a new PR to fix this 👍
I would check this code here: https://github.com/R0Wi-DEV/workflow_ocr/blob/3fa3ad121e711b04c1faa17dac5f2d34364b9d00/lib/OcrProcessors/Local/OcrMyPdfBasedProcessor.php#L66
If the exit code is 6 (see exit code table of ocrmypdf mentioned above), we could just throw a new OcrResultEmptyException (instead of OcrNotPossibleException) with some meaningful message. This would log a warning and wouldn't touch the existing file.
If you need any help, please let me know 💯
@R0Wi If I remeber correctly, I wasn't able to find any handling of error codes for the 'remote' case. Thus, I thought there is missing too much to create a uniform fix using the defined error codes. Any chance to get the error code from remote backend?
@XueSheng-GIT you're absolutely right 😄 Just after I wrote the comment, I also checked the remote implementation and realized that the ErrorResult currently just contains an error message and no information about the exit code. I could offer to add this information to the API response of the Workflow OCR Backend App and let you know as soon as this is done. If you like you could then take over from there and do the adjustments in the workflow_ocr (frontend) app here? Of course I'm happy to support you wherever needed 👍
@XueSheng-GIT I added the exitcode to both the backend and the client implementation. The latter is implemented in https://github.com/R0Wi-DEV/workflow_ocr/pull/302. You should now be able to use the (optional) exit-code by using getOcrMyPdfExitCode, in case an error is returned by the backend. Let me know if you need further assistance. Looking forward to reviewing your pullrequest 👍
@R0Wi Thanks for adding the exit code for remote. I just updated my setup and was able to receive the exit code accordingly. I'll come back will a pull soon.