amazon-textract-textractor
amazon-textract-textractor copied to clipboard
`get_text_from_layout_json` throws `'NoneType' object is not subscriptable` for a specific PDF
get_text_from_layout_json throws 'NoneType' object is not subscriptable for a specific PDF.
Unfortunately, I can't share the specific PDF for privacy reasons — but this line seems to be the cause: https://github.com/aws-samples/amazon-textract-textractor/blob/9fb7d2286a12cadcaf43ae41ef7806591415b079/prettyprinter/textractprettyprinter/t_pretty_print_layout.py#L173 Might also be an issue with Textract's output itself, rather than this library's parsing. This issue seems isolated to a specific PDF, and other pdfs work fine. Notes: seems to be something related to the metadata / structure of the file itself, multiple runs + changing orientiation + deleting pages does not seem to fix the issue.
Is this an error than anyone else has encountered / figured out a resolution for?
Found the specific issue: it is possible for a LAYOUT_FIGURE to have "Relationships": null which breaks this function:
Example:
{"BlockType":"LAYOUT_FIGURE","ColumnIndex":null,"ColumnSpan":null,"Confidence":94.62890625,"EntityTypes":null,"Geometry":{"BoundingBox":{"Height":0.04673086851835251,"Left":0.06788529455661774,"Top":0.8822278380393982,"Width":0.4904918074607849},"Polygon":[{"X":0.06790152192115784,"Y":0.8822278380393982},{"X":0.5583770871162415,"Y":0.8828750252723694},{"X":0.558368444442749,"Y":0.9289587140083313},{"X":0.06788529455661774,"Y":0.9283040761947632}]},"Hint":null,"Id":"4859938e-4c4a-46bb-b40c-34d93486b824","Page":1,"PageClassification":null,"Query":null,"Relationships":null,"RowIndex":null,"RowSpan":null,"SelectionStatus":null,"Text":null,"TextType":null},