amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Access Non-Axis-Aligned Bounding Boxes

Open zkalson opened this issue 1 year ago • 2 comments

Hi all,

Based on my understanding, Textract provides an axis-aligned BoundingBox object and a Polygon object which is composed of more specific points (https://docs.aws.amazon.com/textract/latest/dg/text-location.html). It seems that Textractor only provides the BoundingBox object.

When documents contain significant skew or rotation, axis-aligned boxes will be much larger than non-axis-aligned boxes, and they won't neatly match up with the actual position of the text.

I've attached an example input document, an output text layer using Textractor results, and an output text layer from a different OCR inference that provided non-axis-aligned bounding boxes to hopefully make this easy to visualize.

input_document.pdf text_layer_non-aabb.pdf text_layer_textractor_aabb.pdf

Is it possible to add the Polygon object in Textractor? It would be a big help!

zkalson avatar Apr 17 '24 02:04 zkalson

As a temporary workaround, I am getting the id field from the word/line and finding the associated polygon in Document.response

zkalson avatar Apr 17 '24 03:04 zkalson

You can use the word/lines raw_object member to get the polygon without doing an id-based look up.

https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/parsers/response_parser.py#L226

In the future we would definitely like to support Polygon objects, but it will require some work as a lot of the code is tightly coupled with the BoundingBox object.

Belval avatar May 06 '24 14:05 Belval