tess4j Tess4j OCR results worse than CLI

Using tesseract 5.4.1 and tess4j-5.13.0 (but also seen the same behavior with tess4j-5.4.0)

Sample image: ocrtest

When using command line, the results are perfect:

$ tesseract ocrtest.png stdout --oem 1 --psm 7 -l pol --tessdata-dir /usr/share/tesseract-ocr/5/tessdata/
kogokolwiek, gdziekolwiek

I'm trying to invoke the same through tess4j with the following java code:

Tesseract tesseract = new Tesseract();
tesseract.setDatapath("/usr/share/tesseract-ocr/5/tessdata/");
tesseract.setLanguage("pol");
tesseract.setOcrEngineMode(1);
tesseract.setPageSegMode(7);
BufferedImage testImg = ImageIOHelper.getImageList(new File("ocrtest.png")).get(0);
String result = tesseract.doOCR(testImg);
System.out.println(result);

The result is an empty string! I tried different psms and it prints something for 6 (PSM_SINGLE_BLOCK), but not a fully correct result: | kogokolwiek, geziekolwiek. Anyway, it looks like psm 7 (PSM_SINGLE_LINE) should work the best, since the image contains a single line.

As advised in #264 and related issues, I tried VietOCR and see the same result (nothing recognized by default, imperfect result with psm6).

Sep 11 '24 15:09 mmatela

Using Polish language is not actually necessary to demonstrate the problem, with English it's similar - the result is a bit mangled in CLI, more mangled with tess4j psm6, empty with tess4j psm7.

Sep 11 '24 15:09 mmatela

Please see https://github.com/nguyenq/tess4j/issues/264#issuecomment-2134372852

Sep 12 '24 04:09 nguyenq

Thanks for the pointer, I missed that comment, but it doesn't seem to solve my problem. If I understand correctly, the TextRenderer is only available when calling Tesseract.createDocuments and not in Tesseract.doOCR. But createDocuments doesn't let me define rectangles to process only parts of the input image, which is the main advantage of tess4j for me (otherwise I could just use ProcessBuilder to invoke CLI).

Am I missing something? What's the best way forward? Would it possible to add renderer selection to the doOCR API? Or are there any tricks to process only parts of an image with createDocuments or with CLI? Otherwise I guess I would have to save these parts as separate temporary files...

Sep 12 '24 07:09 mmatela

Also, I just noticed a scary sentence in https://github.com/nguyenq/tess4j/issues/264#issuecomment-2130390181

It's possible or likely that Tesseract CLI performs some basic image preprocessing before OCR stage. You may have to perform similar preprocessing yourself when using tess4j.

Do you still think that's true? That should be a big red warning label on the front page. I don't think many users are aware that they likely get worse OCR results than with CLI, unless they perform additional research and reimplement their own preprocessing.

Sep 12 '24 08:09 mmatela

Thanks for the pointer, I missed that comment, but it doesn't seem to solve my problem. If I understand correctly, the TextRenderer is only available when calling Tesseract.createDocuments and not in Tesseract.doOCR. But createDocuments doesn't let me define rectangles to process only parts of the input image, which is the main advantage of tess4j for me (otherwise I could just use ProcessBuilder to invoke CLI).

Am I missing something? What's the best way forward? Would it possible to add renderer selection to the doOCR API? Or are there any tricks to process only parts of an image with createDocuments or with CLI? Otherwise I guess I would have to save these parts as separate temporary files...

The TextRenderer API expects a path to an image file as input and outputs to a file on the local filesystem. It does not accept specified ROIs. The CLI does not seem to support ROIs either.

https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html

So if you want to use createDocuments on part of an image, you would need to crop it first and save the subimage to the local filesystem before invoking createDocuments. doOCR, which calls Tesseract's GetUTF8Text function behind the scene, supports use of ROIs, but GetUTF8Text API, as opposed to TextRenderer API, follows a different execution path inside Tesseract engine and hence will produce a different result.

Sep 13 '24 04:09 nguyenq

Also, I just noticed a scary sentence in #264 (comment)

It's possible or likely that Tesseract CLI performs some basic image preprocessing before OCR stage. You may have to perform similar preprocessing yourself when using tess4j.

Do you still think that's true? That should be a big red warning label on the front page. I don't think many users are aware that they likely get worse OCR results than with CLI, unless they perform additional research and reimplement their own preprocessing.

Tesseract engine performs some minimal, basic image processing on input images, such as thresholding, before recognition stage. Tess4j inherits the same benefits when it invokes Tesseract API. For some images, this may be sufficient; but for more complicated ones, it may require the user to carry out additional preprocessing on the images -- such as deskewing, denoising, binarization, etc. -- to improve recognition.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

Sep 13 '24 04:09 nguyenq

OK, I'll try to sum up what you said together with what I see in the tess4j code.

doOCR, depending on the selected output format, uses the API calls TessBaseAPIGet[...]Text which work on images already loaded to memory and support Regions Of Interest but don't do preprocessing so OCR quality is likely worse.

createDocuments uses the API calls Tess[...]RendererCreate which then goes to TessBaseAPIProcessPages which only takes paths to image files and doesn't support ROI but performs preprocessing.

And it would be great to enable preprocessing in doOCR, but it's impossible due to API limitations. Would it make sense to ask the Tesseract team to enhance the API in that regard?

Sep 13 '24 09:09 mmatela

With a lot of help from AI I was able to setup a simple C++ project to test the API directly. Turns out that GetUTF8Text recognizes my example perfectly! So there must be something else going on, but I have no idea what to check next.

Here's the C++ code I used:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>

int main() {
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init("/usr/share/tesseract-ocr/5/tessdata/", "pol", tesseract::OEM_LSTM_ONLY)) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }
    api->SetPageSegMode(tesseract::PSM_SINGLE_LINE);

    Pix *image = pixRead("ocrtest.png");
    api->SetImage(image);
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    api->End();
    delete api;
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

Sep 17 '24 10:09 mmatela

One step further, I've got the good result in Java, using TessAPI instead of the Tesseract wrapper:

			TessAPI api = TessAPI.INSTANCE;
			TessBaseAPI handle = api.TessBaseAPICreate();
			api.TessBaseAPIInit2(handle, "/usr/share/tesseract-ocr/5/tessdata/", "pol", 1);
			api.TessBaseAPISetPageSegMode(handle, 7);
			
			BufferedImage bufImg = ImageIOHelper.getImageList(new File("/home/vagrant/tesseract-test/ocrtest.png")).get(0);
			// variant 1
			ByteBuffer buff = ImageIOHelper.convertImageData(bufImg);
			api.TessBaseAPISetImage(handle, buff, bufImg.getWidth(), bufImg.getHeight(), 1, bufImg.getWidth());
			
			// variant 2
	//		Pix pix = LeptUtils.convertImageToPix(bufImg);
	//		api.TessBaseAPISetImage2(handle, pix);
			
			Pointer textPtr = api.TessBaseAPIGetUTF8Text(handle);
			String str = textPtr.getString(0);
			api.TessDeleteText(textPtr);
			System.out.println(str);
			// TODO more cleanup

Variant 1 should be the equivalent of Tesseract.doOCR(), it uses ByteBuffer and prints nothing (or a mangled result with PSM=6), while variant 2 that uses Leptonica's Pix prints the good result. So could it be a problem with converting a BufferedImage into a ByteBuffer? I tried to copy the implementation of getImageByteBuffer used in LeptUtils, and it lead to similar effects, but strangely not the same: the PSM=6 result was even more mangled.

@nguyenq What do you think? Would you consider converting input to Pix in doOCR? Looking at https://github.com/tesseract-ocr/tesseract/blob/4f435363354a4c06730ee1b9a2b5facacf353d6b/src/api/baseapi.cpp#L521 it seems to be highly recommended.

Sep 17 '24 15:09 mmatela

@mmatela We confirm it works with Pix; however, we want to use BufferedImage if possible, since it's Java native image class. A client program would likely have preprocessed a BufferedImage object (in Java code) before sending it for OCR. We need to look at getImageByteBuffer and/or convertImageData method to properly get the pixel data out of the image object to pass to the SetImage method. Moreover, getting the pixel data is probably faster than conversion of a BufferedImage to a Pix.

If you found anything or came up with a PR, please submit it.

Thanks.

Jan 21 '25 03:01 nguyenq

@mmatela https://github.com/nguyenq/tess4j/commit/045a2d5900dd154a828b18acc38308a3233dbd4c has been committed to address the issue. The change was based on your analysis and code examples, and there was no impact on the processing speed. A new release with the fix has been published. Thank you very much.

Feb 15 '25 21:02 nguyenq

tess4j tess4j copied to clipboard

Tess4j OCR results worse than CLI

tess4j
tess4j copied to clipboard