workflow_ocr
workflow_ocr copied to clipboard
Alpha channel
Describe the bug
OCRmyPDF did not produce any output for file [...]
System
- App version: 1.30.1
- Nextcloud version: 30.0.5
- PHP version: 8.2.27
- Environment: native PHP-FPM + Caddy
ocrmypdfversion: 14.0.1+dfsg1
How to reproduce
Steps to reproduce the behavior:
- Take a screenshot with Windows Snipping tool
- Upload it to Nextcloud with matching rules for OCR workflow
- See error
Server log
OCRmyPDF succeeded with warning(s): The input image has an alpha channel. Remove the alpha channel first.\nUnsupportedImageFormatError
OCRmyPDF did not produce any output for file [...]
Additional context
Same error happens if I use ocrmypdf directly with this image.
Afaik the alpha channel is totally useless here as it's just a basic screenshot.
Could we remove it before running ocrmypdf?
Thanks for this idea. I did a bit of research and found this issue, which exactly describes your problem. It seems like the img2pdf component, which is used internally by ocrmypdf, does not allow images with alpha channel because the PDF spec itself doesn't allow it. To cite the original issue:
The whole point of img2pdf is lossless conversion. If you want lossy conversion, then there exist tons of other tools already that are doing this task for you just fine (see README.md). Every additional command line option makes a tool harder to use because there is more documentation to read (not only more options have to be documented but also text has to be added like "yes, it is lossless except in these and those conditions..."). Every additional command line option makes the code more fragile because more conditions have to be handled and more unit tests have to be written. In contrast to all these downsides, the only thing that you have to do is to add a single line of code. So given how it is basically free for you to do this conversion, the additional complexity a "--allow-lossy" option would mean to img2pdf is not justified.
So indeed the only option seems to be to remove any alpha channel first before passing the image to ocrmypdf. Something like this could work (didn't test it yet):
$image = new Imagick($imagePath);
// Check if the image has an alpha channel
if ($image->getImageAlphaChannel()) {
// Remove the alpha channel
$image->setImageAlphaChannel(Imagick::ALPHACHANNEL_REMOVE);
$image->mergeImageLayers(Imagick::LAYERMETHOD_FLATTEN);
// Save the new image
$newImagePath = 'path/to/your/new_image.png';
$image->writeImage($newImagePath);
}
// Clean up
$image->clear();
$image->destroy();
Will experiment a bit within the next days and let you know my results.