tesserocr query lines, words, paragraphs, blocks get error no text returned

Hello,

I try to get all boxes of lines, words, paragraph, blocks and symbols but I get on a second call the error "No text returned". I have written a method to iterate over all boxes

    def _boxes(api, element: tesserocr.RIL) -> Iterable[ElementData]:
        l_boxes = api.GetComponentImages(element, text_only=True)
        for i, (im, box, _, _) in enumerate(l_boxes):
            api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
            yield ElementData(
                id=self._uuid,
                index=i,
                text = self._ocr.api.GetUTF8Text().strip(),
                confidence=ConfidenceData(
                    mean=self._ocr.api.MeanTextConf(),
                    token=[TokenConfidenceData(text=i[0], confidence=i[1]) for i in api.MapWordConfidences()]
                ),
                bounding_box=box
            )

ElementData is a dataclass for storing the data and api is the api reference, I call this function with:

with PyTessBaseAPI() as api:
    api.SetImage(image)

    for i in  _boxes(api, RIL.TEXTLINE):
        // store i in a database
    
   >>! at this point I get an error, the first loop works well as expected
    for i in _boxes(api, RIL.WORD):
        // store i in a database

    for i in  _boxes(api, RIL.PARA):
        // store i in a database
    
    for i in _boxes(api, RIL.BLOCK):
        // store i in a database

    for i in  _boxes(api, RIL.SYMBOL):
        // store i in a database

The input by set image is a PIL-Image as a single page (a JPEG file). The image of the page has goot a header, a footer with some text, multiple paragraphs and multiple lines with words. How can I do this

Feb 24 '21 22:02 flashpixx

I'm not an expert with tesseract's API but is it possible that the SetRectangle call basically limits the detection area to that box so in the next call it's operating on the last SetRectangle area from the first _boxes call. Just something to look into.

Feb 25 '21 15:02 sirfz

yes, I agree, but how can I reset the rectangle after each call?

Feb 25 '21 19:02 flashpixx

I have added bevor each loop:

api.SetRectangle(0, 0, *image.size)

image is the pillow image instance and size returns weight and height in pixel of the image, it works in general, I get boxes for words, symbols, lines etc. But it seems that the ordering is not set correctly, so e.g. words boxes does not have got an order like the origin text, so if I have get all words but I cannot create by concatinating the origin text.

My goal is to get the whole text in different box detail levels

Mar 02 '21 17:03 flashpixx

I also see your use of SetRectangle as the culprit. The API doc says:

Each SetRectangle clears the recogntion results so multiple rectangles can be recognized with the same image.

You want to use that function before triggering layout analysis or recognition, not afterwards. Since you are already using the page iterator (via GetComponentImages), you only need to loop over results on all hierarchy levels. (I also recommend restructuring your main loop so that it follows the natural RIL recursion.)

The use case for SetRectangle across levels is if you have an external segmentation of the image into regions, paragraphs, lines or words. (Which would also entail using SetPageSegMode(PSM.SINGLE_COLUMN) / SINGLE_BLOCK / SINGLE_LINE / SINGLE_WORD.)

Jul 02 '21 18:07 bertsky

@sirfz, again, the problem is already in the usage example of the current README:

https://github.com/sirfz/tesserocr/blob/711cbab544dbb4bd3dcf1f13aad9d0fef20fcac7/README.rst#L181-L187

Jul 02 '21 22:07 bertsky