Pass clean image to Tesseract to avoid conversion warnings
Hi, first of all, thanks for a great module! :-) By spending a couple of hours creating templates for my documents, I will be saving hours each month! :+1:
However, some of my documents are scanned, so I am using tesseract to extract the text from it. This works fine and as expected, but it does generate some errors on stderr:
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/magick-47439a8uUu8VvEXM' @ warning/png.c/MagickPNGWarningHandler/1654.
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 492
I'm running the invoice2data module from a custom python script which does everything I need it to in 1 go, so I'd like to see these error messages go away.
The first one seems to indicate that setting an RGB profile for grayscale images is not allowed; I haven't been able to figure out what causes the other messages yet (could be due to my lack of imagemagick skills).
My system:
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
Python 2.7.15rc1
Invoice2data 0.2.101.dist
I could simply suppress the stderr messages (as mentioned, the module works as expected), but that seems wrong and there is no clean way to do this from within python without modifying the module directly.
Any idea what the issue could be? And can I fix this with settings or is it a bug in the module? Any help is much appreciated!
Those messages come from Tesseract. If there is a switch to turn them off, we can add them to invoice2data.
Or there is an issue with the image passed to Tesseract. This I can look into.
Does this issue persist when using the tensorflow4 output plugin?
This issue is because how this module pre-processes the image. The document is converted to the tiff format along with some other parameters. However, imagemagick is not very good at passing the resolution data found in the exif of the input file to the output file in a different format. (also some output formats don't allow exif data.)
The conversion code is in: https://github.com/invoice-x/invoice2data/blob/36863386692a2393d561cdaa1e3a5e4938c34935/src/invoice2data/input/tesseract.py#L27-L38
From my testing, I got better ocr results by passing my png files directly to tesseract. In my opinion, The flattening of the image in this code does more harm than good. Tesseract uses leptonic to read/pre-process the image.
yet. the conversion code is required for other file formats as tesseract can't handle pdf files as an input directly.
Sorry, the fix fell out the previous commit. Currently doing a rewrite on the tesseract modules. Will include this in the new commit.
Fixed by #421