Normalize non-tesseract ocr bounding box
Hi @NielsRogge
i'm trying to use external ocr ( paddleocr or googlevision) for processing with layoutlmv2
the docs here state that you need to normalize each word's bounding box with (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box
when i use google vision , i got bounding box format look like
{
'property': {
'detectedLanguages': [{
'languageCode': 'it'
}],
'detectedBreak': {
'type': 'SPACE'
}
},
'boundingBox': {
'vertices': [{
'x': 197,
'y': 56
}, {
'x': 268,
'y': 59
}, {
'x': 263,
'y': 167
}, {
'x': 192,
'y': 164
}]
},
'text': 'Some text here',
'confidence': 0.9900000095367432
}
where vertices(x0,y0,x1,y1) give two more coordinate each(x,y)
do you know how can normalize it or process it in layoutlm processor?
thank you
As you mentioned, the LayoutLM models expect an axis-aligned bounding box, while this example contains a slightly tilted rectangle. If you want to use this kind of output as an input for the LayoutLM models, you will need to parse the object that the Google API gives you from this form into the (x0, y0, x1, y1) format that LayoutLM expects. A first approach to do so could be to find the smallest enclosing (axis-aligned) rectangle, given your 4 vertices, then taking the vertices of this enclosing rectangle and transforming them into the (x0, y0, x1, y1) format