Normalize non-tesseract ocr bounding box

Open acul3 opened this issue 3 years ago • 1 comments

Hi @NielsRogge

i'm trying to use external ocr ( paddleocr or googlevision) for processing with layoutlmv2

the docs here state that you need to normalize each word's bounding box with (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box

when i use google vision , i got bounding box format look like

{
  'property': {
    'detectedLanguages': [{
      'languageCode': 'it'
    }],
    'detectedBreak': {
      'type': 'SPACE'
    }
  },
  'boundingBox': {
    'vertices': [{
      'x': 197,
      'y': 56
    }, {
      'x': 268,
      'y': 59
    }, {
      'x': 263,
      'y': 167
    }, {
      'x': 192,
      'y': 164
    }]
  },
  'text': 'Some text here',
  'confidence': 0.9900000095367432
}

where vertices(x0,y0,x1,y1) give two more coordinate each(x,y)

do you know how can normalize it or process it in layoutlm processor?

thank you

Jun 18 '22 13:06 acul3

As you mentioned, the LayoutLM models expect an axis-aligned bounding box, while this example contains a slightly tilted rectangle. If you want to use this kind of output as an input for the LayoutLM models, you will need to parse the object that the Google API gives you from this form into the (x0, y0, x1, y1) format that LayoutLM expects. A first approach to do so could be to find the smallest enclosing (axis-aligned) rectangle, given your 4 vertices, then taking the vertices of this enclosing rectangle and transforming them into the (x0, y0, x1, y1) format

Jun 29 '22 14:06 LucaMalagutti