tesseract Repair boundingbox of individual characters of textangle 90 text

Repair boundingbox of individual characters of textangle 90 text

Open rmast opened this issue 2 years ago • 7 comments

Solution to issue #3590 (makebox doesn't output horizontal coordinates of textangle 90 content).

I followed these lines back to 2010, there has been no-one fiddling with these lines, however they were most suspect of excluding RIL_SYMBOL from the matrix transformation at textangle 90.

TBOX rotate operations don't seem expensive, so it's not known why the exclusion for RIL_SYMBOL has ever been introduced.

Oct 18 '21 20:10 rmast

Please remove the two unused code lines and fix the indentation of the remaining line.

Done

~Is that code only used when making boxes, or is it also used in recognition?~

I see this second question has been striked through. I don't know. I've run make check afterward and after gigabytes of possibly dependend languages were downloaded and in place the checks all ran fine. It might be helpful to explicitly require only a needed set of languages in advance of the make check.

Oct 28 '21 15:10 rmast

@stweil @zdenop Is there anything that could be done to get this PR merged? I can create a separate PR rebased on top of latest main branch tip, add tests, etc.

I have looked into all call paths that could potentially be affected by the code change. There aren't too many of them in the first place and all of them are within the auxiliary functionality of getting the results of tesseract out, not the recognition itself. As a consequence, the risk of a regression is probably relatively small.

The following is the list of APIs affected:

PageIterator::BoundingBoxInternal when level == RIL_SYMBOL
PageIterator::BoundingBox when level == RIL_SYMBOL
PageIterator::GetImage when level == RIL_SYMBOL
PageIterator::GetBinaryImage when level == RIL_SYMBOL
TessBaseAPI::GetComponentImages when level == RIL_SYMBOL
TessBaseAPI::GetConnectedComponents
TessBaseAPI::GetBoxText
output of hOCR renderer when hocr_char_boxes is enabled
output of BoxText renderer

Apr 21 '22 22:04 p12tic

I am sorry, but I have a minimum spare time for tesseract. PR seems to be interesting, but as this effect API, it should be well tested including effect on training.

Apr 28 '22 12:04 zdenop

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

Apr 28 '22 13:04 p12tic

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

@p12tic I'm interested in solving the bounding box problem. I will try to write regression tests with automatic measures covering more scripts, languages and fonts. It will need some time if I find some time in the next weeks/months. The complicated part is to create ground truth with correct bounding boxes.

Apr 29 '22 12:04 wollmers

@wollmers This is great to hear. Is there any way to help? I could translate very high-level directions into working code :-) For you answering a small number of questions should take much less time than doing the implementation.

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

I'm assuming that you don't want to go the route of rendering text and OCRing the result images back, like when doing LSTM training in certain cases. In this case the character positions are essentially already known. Well, at least that's my understanding which could be completely wrong.

May 02 '22 00:05 p12tic

@p12tic

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

Sorry, mismatched this PR with PR 3787. For 3787 (normal text without rotation) I wrote an approach at the weekend. See ocr-bbox-gt in prototypish Perl (without the dependencies, not published yet). If you can read it you can port it to your favourite language, which is maybe Python.

For text angles other than 0 degrees, the text image can be rotated before OCR and the bounding boxes geometrically transformed back. For degrees other than a multiple of 90 a polygon notation is needed, something like x1,y1 x2,y2 x3,y3.

Just use a clean image of text, which has no recognition errors (CER 0.0). That's the case for the sample image in 3787. Then use a legacy model with --oem 0 which provides nearly perfect bounding boxes.

Now we can check the quality as follows:

count the width (and height) per character in a first scan of the bboxes
select the width with the highest frequency ("best width")
compare the best width against the actual width (abs(width1 - width2))
count deviations per deviation

Of course this works only with clean, generated images in one and the same font, style and size. But we want to isolate the problem, reduce it only to bbox errors, thus want to exclude all other seasons for errors.

As text one page of the Human Rights Declaration (available in ~500 languages) can be used. Format it with a popular font, export as PDF, pdftoimage, tesseract. That's the work to get ground truth. Then measure the errors compared between ground truth, before patch, after patch.

With legacy only a few characters have deviation:

$ tesseract pr_3787.png pr_3787.oem0.psm6.lat.png -l t5data/lat --oem 0 --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

$charboxes: 1706

width frequency per character
[...] # all other characters have only one width
r 15: 88,  16: 3,  17: 2, 
s 15: 156, 
t 13: 107,  15: 10,  14: 7,  12: 3,  16: 3, 
u 24: 138,  25: 5, 
v 24: 20,  25: 1,

width errors
    exact: 1653 (0.9689) # color green
    in   : 50   (0.0293) # within +/-2; color orange
    out  : 3    (0.0018) # 3 't' with width 16; color red

One of the 3 errors (deviation 3 pixels):

Bildschirmfoto 2022-05-02 um 11 02 03

It would be easy to correct this few remaining errors in a website (import the bboxes as JSON and wite the corrections back). Then the resulting bbox file is the ground truth.

With CTC/LTSM Tesseract release 5.1.0 it looks like this:

$ tesseract pr_3787.png pr_3787.psm6.png -l deu  --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

width errors
    exact: 1304 (0.7644)
    in   : 89   (0.0522)
    out  : 314  (0.1841)

The same part of the image with CTC/LTSM:

Bildschirmfoto 2022-05-02 um 11 05 33

May 02 '22 09:05 wollmers

@wollmers wrote

Sorry, mismatched this PR with PR 3787.

Yes, I first wasn't able to understand what my rotation fix had to do with your response, but as I now also have run into a bounding-box issue I'll be glad trying your PR to see if that fixes it. I'll first see whether I can satisfactorily get it running with LSTM before reverting to OEM 0 for my bounding boxes. My fix doesn't fix straight up bounding boxes.

Aug 13 '22 17:08 rmast

tesseract tesseract copied to clipboard

Repair boundingbox of individual characters of textangle 90 text

tesseract
tesseract copied to clipboard