tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Repair boundingbox of individual characters of textangle 90 text

Open rmast opened this issue 2 years ago • 7 comments

Solution to issue #3590 (makebox doesn't output horizontal coordinates of textangle 90 content).

I followed these lines back to 2010, there has been no-one fiddling with these lines, however they were most suspect of excluding RIL_SYMBOL from the matrix transformation at textangle 90.

TBOX rotate operations don't seem expensive, so it's not known why the exclusion for RIL_SYMBOL has ever been introduced.

rmast avatar Oct 18 '21 20:10 rmast

Please remove the two unused code lines and fix the indentation of the remaining line.

Done

~Is that code only used when making boxes, or is it also used in recognition?~

I see this second question has been striked through. I don't know. I've run make check afterward and after gigabytes of possibly dependend languages were downloaded and in place the checks all ran fine. It might be helpful to explicitly require only a needed set of languages in advance of the make check.

rmast avatar Oct 28 '21 15:10 rmast

@stweil @zdenop Is there anything that could be done to get this PR merged? I can create a separate PR rebased on top of latest main branch tip, add tests, etc.

I have looked into all call paths that could potentially be affected by the code change. There aren't too many of them in the first place and all of them are within the auxiliary functionality of getting the results of tesseract out, not the recognition itself. As a consequence, the risk of a regression is probably relatively small.

The following is the list of APIs affected:

  • PageIterator::BoundingBoxInternal when level == RIL_SYMBOL
  • PageIterator::BoundingBox when level == RIL_SYMBOL
  • PageIterator::GetImage when level == RIL_SYMBOL
  • PageIterator::GetBinaryImage when level == RIL_SYMBOL
  • TessBaseAPI::GetComponentImages when level == RIL_SYMBOL
  • TessBaseAPI::GetConnectedComponents
  • TessBaseAPI::GetBoxText
  • output of hOCR renderer when hocr_char_boxes is enabled
  • output of BoxText renderer

p12tic avatar Apr 21 '22 22:04 p12tic

I am sorry, but I have a minimum spare time for tesseract. PR seems to be interesting, but as this effect API, it should be well tested including effect on training.

zdenop avatar Apr 28 '22 12:04 zdenop

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

p12tic avatar Apr 28 '22 13:04 p12tic

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

@p12tic I'm interested in solving the bounding box problem. I will try to write regression tests with automatic measures covering more scripts, languages and fonts. It will need some time if I find some time in the next weeks/months. The complicated part is to create ground truth with correct bounding boxes.

wollmers avatar Apr 29 '22 12:04 wollmers

@wollmers This is great to hear. Is there any way to help? I could translate very high-level directions into working code :-) For you answering a small number of questions should take much less time than doing the implementation.

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

I'm assuming that you don't want to go the route of rendering text and OCRing the result images back, like when doing LSTM training in certain cases. In this case the character positions are essentially already known. Well, at least that's my understanding which could be completely wrong.

p12tic avatar May 02 '22 00:05 p12tic

@p12tic

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

Sorry, mismatched this PR with PR 3787. For 3787 (normal text without rotation) I wrote an approach at the weekend. See ocr-bbox-gt in prototypish Perl (without the dependencies, not published yet). If you can read it you can port it to your favourite language, which is maybe Python.

For text angles other than 0 degrees, the text image can be rotated before OCR and the bounding boxes geometrically transformed back. For degrees other than a multiple of 90 a polygon notation is needed, something like x1,y1 x2,y2 x3,y3.

Just use a clean image of text, which has no recognition errors (CER 0.0). That's the case for the sample image in 3787. Then use a legacy model with --oem 0 which provides nearly perfect bounding boxes.

Now we can check the quality as follows:

  • count the width (and height) per character in a first scan of the bboxes
  • select the width with the highest frequency ("best width")
  • compare the best width against the actual width (abs(width1 - width2))
  • count deviations per deviation

Of course this works only with clean, generated images in one and the same font, style and size. But we want to isolate the problem, reduce it only to bbox errors, thus want to exclude all other seasons for errors.

As text one page of the Human Rights Declaration (available in ~500 languages) can be used. Format it with a popular font, export as PDF, pdftoimage, tesseract. That's the work to get ground truth. Then measure the errors compared between ground truth, before patch, after patch.

With legacy only a few characters have deviation:

$ tesseract pr_3787.png pr_3787.oem0.psm6.lat.png -l t5data/lat --oem 0 --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

$charboxes: 1706

width frequency per character
[...] # all other characters have only one width
r 15: 88,  16: 3,  17: 2, 
s 15: 156, 
t 13: 107,  15: 10,  14: 7,  12: 3,  16: 3, 
u 24: 138,  25: 5, 
v 24: 20,  25: 1,

width errors
    exact: 1653 (0.9689) # color green
    in   : 50   (0.0293) # within +/-2; color orange
    out  : 3    (0.0018) # 3 't' with width 16; color red

One of the 3 errors (deviation 3 pixels):

Bildschirmfoto 2022-05-02 um 11 02 03

It would be easy to correct this few remaining errors in a website (import the bboxes as JSON and wite the corrections back). Then the resulting bbox file is the ground truth.

With CTC/LTSM Tesseract release 5.1.0 it looks like this:

$ tesseract pr_3787.png pr_3787.psm6.png -l deu  --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

width errors
    exact: 1304 (0.7644)
    in   : 89   (0.0522)
    out  : 314  (0.1841)

The same part of the image with CTC/LTSM:

Bildschirmfoto 2022-05-02 um 11 05 33

wollmers avatar May 02 '22 09:05 wollmers

@wollmers wrote

Sorry, mismatched this PR with PR 3787.

Yes, I first wasn't able to understand what my rotation fix had to do with your response, but as I now also have run into a bounding-box issue I'll be glad trying your PR to see if that fixes it. I'll first see whether I can satisfactorily get it running with LSTM before reverting to OEM 0 for my bounding boxes. My fix doesn't fix straight up bounding boxes.

rmast avatar Aug 13 '22 17:08 rmast