gcv2hocr icon indicating copy to clipboard operation
gcv2hocr copied to clipboard

gcv2hocr doesn't rectify negative coordinates in GCV API response

Open SoloSynth1 opened this issue 3 years ago • 0 comments

According to the hOCR standard (Latest is v1.2 as of March 2021), the bbox property specifies uint to be used. That means all values must be unsigned. (http://kba.cloud/hocr-spec/1.2/#propdef-bbox)

However, the textAnnotation API response from GCV will provide negative coordinates for some out-of-bound boxes, such as the example below:

{
  "description": "2-3/4300/62",
  "boundingPoly": {
    "vertices": [
      {
        "x": 4727,
        "y": -1
      },
      {
        "x": 4927,
        "y": 0
      },
      {
        "x": 4927,
        "y": 44
      },
      {
        "x": 4727,
        "y": 43
      }
    ],
    "normalizedVertices": []
  },
  "mid": "",
  "locale": "",
  "score": 0,
  "confidence": 0,
  "topicality": 0,
  "locations": [],
  "properties": []
}

In the current gcv2hocr script, such case will be parsed into .hocr file without retification, resulting in lines like this:

<span class='ocr_line' id='line_1_2' title="bbox 4727 -2 4927 44 ; baseline 0 -5; x_size 89; x_descenders 20; x_ascenders 21"><span class='ocrx_word' id='word_1_2' title='bbox 4727 -2 4927 44 ; x_wconf 85' lang='eng' dir='ltr'>  2-3/4300/62  </span>

This is causing hocr-pdf to error when trying to parse this illegal ocr_line. While hocr-pdf seems to work just fine by altering the parsing regex rule, It would be great if the script can implement some form of retification on the negative values in order to adhere with the cureent hOCR standard, thanks!

SoloSynth1 avatar Mar 24 '21 09:03 SoloSynth1