gcv2hocr
gcv2hocr copied to clipboard
gcv2hocr doesn't rectify negative coordinates in GCV API response
According to the hOCR standard (Latest is v1.2 as of March 2021), the bbox property specifies uint
to be used. That means all values must be unsigned. (http://kba.cloud/hocr-spec/1.2/#propdef-bbox)
However, the textAnnotation
API response from GCV will provide negative coordinates for some out-of-bound boxes, such as the example below:
{
"description": "2-3/4300/62",
"boundingPoly": {
"vertices": [
{
"x": 4727,
"y": -1
},
{
"x": 4927,
"y": 0
},
{
"x": 4927,
"y": 44
},
{
"x": 4727,
"y": 43
}
],
"normalizedVertices": []
},
"mid": "",
"locale": "",
"score": 0,
"confidence": 0,
"topicality": 0,
"locations": [],
"properties": []
}
In the current gcv2hocr
script, such case will be parsed into .hocr file without retification, resulting in lines like this:
<span class='ocr_line' id='line_1_2' title="bbox 4727 -2 4927 44 ; baseline 0 -5; x_size 89; x_descenders 20; x_ascenders 21"><span class='ocrx_word' id='word_1_2' title='bbox 4727 -2 4927 44 ; x_wconf 85' lang='eng' dir='ltr'> 2-3/4300/62 </span>
This is causing hocr-pdf
to error when trying to parse this illegal ocr_line
.
While hocr-pdf
seems to work just fine by altering the parsing regex rule, It would be great if the script can implement some form of retification on the negative values in order to adhere with the cureent hOCR standard, thanks!