tess4j
tess4j copied to clipboard
Tess4j does not correctly handle images with alpha channel
If an image has a alpha channel (regardless of if this image has actual transparent pixels or not) the OCR output is empty.
I had one tiff image that wouldn't OCR and it took me QUITE a long time of trial and error to figure out why this one file wouldn't OCR and other seemingly identical ones would When calling tesseract(.exe) directly: the image is correctly OCR'ed.
Tess4J should either throw an exception or do the OCR.
Minimum SSCCE
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;
public class Tess4JTransparentTiffErrPOC {
private static final String PATH_TO_TESS_DATA = "c:\\opt\\lib\\tessdata-4.1.0";
private static final Path transparencyTiff = Path.of(Objects.requireNonNull(System.getenv("USERPROFILE"))).resolve("Desktop")
.resolve("wikipedia_bio_margret_ives_abbot.tiff");
public static void main(String[] args) {
ITesseract instance = new Tesseract();
instance.setDatapath(PATH_TO_TESS_DATA);
try {
String result = instance.doOCR(transparencyTiff.toFile());
System.out.printf("Unmodified input OCR is length %d:%n", result.length());
System.out.println(result);
//now flatten the image
BufferedImage img = flattenImage(transparencyTiff);
result = instance.doOCR(img);
System.out.printf("Flattened input OCR is length %d:%n", result.length());
System.out.println(result);
} catch (Exception e) {
e.printStackTrace();
}
}
/** @noinspection SameParameterValue*/
private static BufferedImage flattenImage(Path path) {
final ImageReader imageReader = ImageIO.getImageReadersBySuffix("tiff").next();
try (ImageInputStream is = ImageIO.createImageInputStream(Files.newInputStream(path))) {
imageReader.setInput(is);
final BufferedImage pageImage = imageReader.read(0);
final BufferedImage flattened = new BufferedImage(pageImage.getWidth(), pageImage.getHeight(), BufferedImage.TYPE_INT_RGB);
Graphics2D graphics = flattened.createGraphics();
graphics.setColor(Color.WHITE);
graphics.fillRect(0, 0, flattened.getWidth(), flattened.getHeight());
graphics.drawImage(pageImage, 0, 0, null);
graphics.dispose();
return flattened;
} catch (IOException ioe) {
throw new UncheckedIOException(ioe);
}
}
}
You'd need to preprocess such images. We applied the monochrome filter in VietOCR, which uses Tess4J, and were able to get the OCR text.
In the example, flattenImage basically "preprocesses" the image.
Without performing that operation: VietOCR also fails to produce any OCR text
So I believe that behavior is exhibiting the same bug
Since Tesseract probably has this preprocessing step when reading an image, you'll need to do the same in your Java code as Tess4J wrapper does not include any image preprocessing; it only reads and sends image data to the engine.
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#transparency--alpha-channel
That's for version 3.
This is version 5.
https://github.com/nguyenq/tess4j/tree/master/src/main/resources/win32-x86-64
@waljohn Please submit a PR.
We may need to debug and trace through the native code to determine what preprocessing is performed for this kind of image.
@waljohn This issue is identical as https://github.com/nguyenq/tess4j/issues/264, we found out.
Tesseract OCR engine did not perform any special preprocessing on this image. The CLI has used TextRenderer, not GetUTF8Text, which doOCR calls, to create the output text file. If you used the renderer in your program, you'd get the expected matching results. You can verify by using VietOCR's Bulk OCR function, which uses the renderers.