tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Integer division by zero in TabVector::Evaluate

Open nullpointersetc opened this issue 2 years ago • 9 comments

Environment

  • Tesseract Version: 4.1.0
  • Commit Number: 5280bbcade4e2dec5eef439a6e189504c2eadcd9
  • Platform: Windows 10, 64-bit, Version 21H1 (OS Build 19043.1526)

Current Behavior:

On a certain image, an integer division-by-zero exception occurs and the OCR program using Tesseract as a library is terminated.

We have determined that the problem is in method TabVector::Evaluate in src/textord/tabvector.cpp, and specifically in this section of the code:

  // If there has been a good box, adjust the end.
  if (prev_good_box != nullptr) {
    SetYEnd(prev_good_box->top());
    // Compute the percentage of the vector that is occupied by good boxes.
    int length = endpt_.y() - startpt_.y();
    percent_score_ = 100 * good_length / length;
    if (num_deleted_boxes > 0) {
      needs_refit_ = true;
      FitAndEvaluateIfNeeded(vertical, finder);
      if (boxes_.empty())
        return;
    }
    ...
}

There is no validation before the assignment to percent_score_ that length is not zero (i.e., that endpt_.y() does not equal startpt_.y()).

Expected Behavior:

The integer division is not attempted and the process does not abort.

Suggested Fix:

    percent_score_ = length == 0 ? 0 : 100 * good_length / length;

nullpointersetc avatar Feb 25 '22 15:02 nullpointersetc

Could you please provide an image which triggers this division by zero?

It does not make sense to simply add a check for the division. First we have to analyse why this function is called with endpt_.y() == startpt_.y() (so it is a point, not a vector).

stweil avatar Feb 25 '22 15:02 stweil

Could you please provide an image which triggers this division by zero?

It does not make sense to simply add a check for the division. First we have to analyse why this function is called with endpt_.y() == startpt_.y() (so it is a point, not a vector).

In case of a point the length should be 1.

wollmers avatar Feb 28 '22 11:02 wollmers

I currently don't have an image that I can give you.

nullpointersetc avatar Feb 28 '22 19:02 nullpointersetc

@nullpointersetc, maybe some part of an image which can be published is sufficient to trigger the issue, or you can send me a confidential image per e-mail. I am afraid that we have to close the issue without a fix if there is no test case.

stweil avatar Mar 12 '22 08:03 stweil

I don't know how to construct such an image.

For example, if I try to construct such an image with this text: Fury_Road_2

I get back that the image is 2548 x 3298 (the original image was 1019 x 1319 at 120 DPI, so that may be explained)

TabVector::Evaluate is called only four times for this image. At the if statement I indicated, the values are:

  1. startpt_={xcoord=234, ycoord=971}, endpt_={xcoord=234, ycoord=3026}, and prev_good_box={bot_left={xcoord=237 ycoord=2994 } top_right={xcoord=272 ycoord=3026 } }

  2. startpt_={xcoord=2270 ycoord=961 }, endpt_={xcoord=2270 ycoord=3016 }, and prev_good_box = {bot_left={xcoord=2147 ycoord=2994 } top_right={xcoord=2167 ycoord=3016 } }

  3. startpt_ = {xcoord=234 ycoord=971 }, endpt_ = {xcoord=234 ycoord=3026 }, and prev_good_box={bot_left={xcoord=237 ycoord=2994 } top_right={xcoord=272 ycoord=3026 } }

  4. startpt_ = {xcoord=2270 ycoord=961 }, endpt_ = {xcoord=2270 ycoord=3016 }, and prev_good_box={bot_left={xcoord=2147 ycoord=2994 } top_right={xcoord=2167 ycoord=3016 } }

I DO NOT know how to interpret these numbers. I would have assumed that these are number of pixels from the top-left of the image, but the startpt_ and endpt_ all seem to refer to a vertical region of the screen that's one pixel wide and consist of only white pixels, while the good boxes appear to be all white pixels. Am I going along the right path in trying to come up with an image?

nullpointersetc avatar Mar 31 '22 22:03 nullpointersetc

Did you try to use version 5.1.0 with the same image?

amitdo avatar Jun 12 '22 17:06 amitdo

In the beginning of this method length == 0 is checked as part of a condition.

https://github.com/tesseract-ocr/tesseract/blob/76faf1600643f45f22555dcbc5d39e93f96825d6/src/textord/tabvector.cpp#L580-L589

amitdo avatar Jun 12 '22 17:06 amitdo

I don't expect that 5.1.0 or our latest code fixed this issue. @nullpointersetc, it would really help if you could provide an image which triggers the bug. You can send it to my personal e-mail address, and I will keep it private.

stweil avatar Jun 15 '22 05:06 stweil

@nullpointersetc, it would also be interesting whether the same bug also occurs on Linux or MacOS. Could you please test it (that's also possible on Windows with WSL)?

stweil avatar Jun 15 '22 05:06 stweil