amazon-textract-textractor
amazon-textract-textractor copied to clipboard
Add Page to DocumentEntity
Issue #, if available: #170
Description of changes: The original issue was that word and line bounding boxes were shifted in some cases when page width or page height != 1 (100%) because the visualizer uses the page width/height as relative coordinates for document entities such as Word/Line. This is problematic because the Textract API actually returns bounding boxes relative to the image size, not the page size.
This PR fixes the base issue but also reworks the DocumentEntity object to give it a .page and .page_id properties to remove the code duplication in all DocumentEntities.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.