tess4j Tess4j does not correctly handle images with alpha channel

If an image has a alpha channel (regardless of if this image has actual transparent pixels or not) the OCR output is empty.

I had one tiff image that wouldn't OCR and it took me QUITE a long time of trial and error to figure out why this one file wouldn't OCR and other seemingly identical ones would When calling tesseract(.exe) directly: the image is correctly OCR'ed.

Tess4J should either throw an exception or do the OCR.

Minimum SSCCE

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;

import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;

public class Tess4JTransparentTiffErrPOC {

    private static final String PATH_TO_TESS_DATA = "c:\\opt\\lib\\tessdata-4.1.0";
    private static final Path transparencyTiff = Path.of(Objects.requireNonNull(System.getenv("USERPROFILE"))).resolve("Desktop")
            .resolve("wikipedia_bio_margret_ives_abbot.tiff");

    public static void main(String[] args)  {

        ITesseract instance = new Tesseract();
        instance.setDatapath(PATH_TO_TESS_DATA);

        try {
            String result = instance.doOCR(transparencyTiff.toFile());
            System.out.printf("Unmodified input OCR is length %d:%n", result.length());
            System.out.println(result);

            //now flatten the image
            BufferedImage img = flattenImage(transparencyTiff);

            result = instance.doOCR(img);
            System.out.printf("Flattened input OCR is length %d:%n", result.length());
            System.out.println(result);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /** @noinspection SameParameterValue*/
    private static BufferedImage flattenImage(Path path) {
        final ImageReader imageReader = ImageIO.getImageReadersBySuffix("tiff").next();
        try (ImageInputStream is = ImageIO.createImageInputStream(Files.newInputStream(path))) {
            imageReader.setInput(is);
            final BufferedImage pageImage = imageReader.read(0);
            final BufferedImage flattened = new BufferedImage(pageImage.getWidth(), pageImage.getHeight(), BufferedImage.TYPE_INT_RGB);
            Graphics2D graphics = flattened.createGraphics();
            graphics.setColor(Color.WHITE);
            graphics.fillRect(0, 0, flattened.getWidth(), flattened.getHeight());
            graphics.drawImage(pageImage, 0, 0, null);
            graphics.dispose();
            return flattened;
        } catch (IOException ioe) {
            throw new UncheckedIOException(ioe);
        }
    }
}

wikipedia_bio_margret_ives_abbot.tiff.gz

Jun 17 '23 17:06 waljohn

You'd need to preprocess such images. We applied the monochrome filter in VietOCR, which uses Tess4J, and were able to get the OCR text.

Jun 25 '23 16:06 nguyenq

In the example, flattenImage basically "preprocesses" the image.

Without performing that operation: VietOCR also fails to produce any OCR text

So I believe that behavior is exhibiting the same bug

Jun 25 '23 17:06 waljohn

Since Tesseract probably has this preprocessing step when reading an image, you'll need to do the same in your Java code as Tess4J wrapper does not include any image preprocessing; it only reads and sends image data to the engine.

Jun 26 '23 16:06 nguyenq

https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#transparency--alpha-channel

Jun 26 '23 22:06 nguyenq

That's for version 3.

This is version 5.

https://github.com/nguyenq/tess4j/tree/master/src/main/resources/win32-x86-64

Jul 01 '23 19:07 waljohn

@waljohn Please submit a PR.

Jul 06 '23 01:07 nguyenq

We may need to debug and trace through the native code to determine what preprocessing is performed for this kind of image.

May 27 '24 14:05 nguyenq

@waljohn This issue is identical as https://github.com/nguyenq/tess4j/issues/264, we found out.

Tesseract OCR engine did not perform any special preprocessing on this image. The CLI has used TextRenderer, not GetUTF8Text, which doOCR calls, to create the output text file. If you used the renderer in your program, you'd get the expected matching results. You can verify by using VietOCR's Bulk OCR function, which uses the renderers.

May 28 '24 05:05 nguyenq

tess4j tess4j copied to clipboard

Tess4j does not correctly handle images with alpha channel

tess4j
tess4j copied to clipboard