workflow_ocr Alpha channel

Alpha channel

Open bonswouar opened this issue 9 months ago • 1 comments

Describe the bug

OCRmyPDF did not produce any output for file [...]

System

App version: 1.30.1
Nextcloud version: 30.0.5
PHP version: 8.2.27
Environment: native PHP-FPM + Caddy
ocrmypdf version: 14.0.1+dfsg1

How to reproduce

Steps to reproduce the behavior:

Take a screenshot with Windows Snipping tool
Upload it to Nextcloud with matching rules for OCR workflow
See error

Server log

OCRmyPDF succeeded with warning(s): The input image has an alpha channel. Remove the alpha channel first.\nUnsupportedImageFormatError
OCRmyPDF did not produce any output for file [...]

Additional context

Same error happens if I use ocrmypdf directly with this image. Afaik the alpha channel is totally useless here as it's just a basic screenshot. Could we remove it before running ocrmypdf?

Feb 08 '25 13:02 bonswouar

Thanks for this idea. I did a bit of research and found this issue, which exactly describes your problem. It seems like the img2pdf component, which is used internally by ocrmypdf, does not allow images with alpha channel because the PDF spec itself doesn't allow it. To cite the original issue:

The whole point of img2pdf is lossless conversion. If you want lossy conversion, then there exist tons of other tools already that are doing this task for you just fine (see README.md). Every additional command line option makes a tool harder to use because there is more documentation to read (not only more options have to be documented but also text has to be added like "yes, it is lossless except in these and those conditions..."). Every additional command line option makes the code more fragile because more conditions have to be handled and more unit tests have to be written. In contrast to all these downsides, the only thing that you have to do is to add a single line of code. So given how it is basically free for you to do this conversion, the additional complexity a "--allow-lossy" option would mean to img2pdf is not justified.

So indeed the only option seems to be to remove any alpha channel first before passing the image to ocrmypdf. Something like this could work (didn't test it yet):

$image = new Imagick($imagePath);

// Check if the image has an alpha channel
if ($image->getImageAlphaChannel()) {
    // Remove the alpha channel
    $image->setImageAlphaChannel(Imagick::ALPHACHANNEL_REMOVE);
    $image->mergeImageLayers(Imagick::LAYERMETHOD_FLATTEN);

    // Save the new image
    $newImagePath = 'path/to/your/new_image.png';
    $image->writeImage($newImagePath);
}

// Clean up
$image->clear();
$image->destroy();

Will experiment a bit within the next days and let you know my results.

Feb 10 '25 05:02 R0Wi

workflow_ocr workflow_ocr copied to clipboard

Alpha channel

workflow_ocr
workflow_ocr copied to clipboard