amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

`get_text_from_layout_json` throws `'NoneType' object is not subscriptable` for a specific PDF

Open neil-sola opened this issue 1 year ago • 1 comments

get_text_from_layout_json throws 'NoneType' object is not subscriptable for a specific PDF.

Unfortunately, I can't share the specific PDF for privacy reasons — but this line seems to be the cause: https://github.com/aws-samples/amazon-textract-textractor/blob/9fb7d2286a12cadcaf43ae41ef7806591415b079/prettyprinter/textractprettyprinter/t_pretty_print_layout.py#L173 Might also be an issue with Textract's output itself, rather than this library's parsing. This issue seems isolated to a specific PDF, and other pdfs work fine. Notes: seems to be something related to the metadata / structure of the file itself, multiple runs + changing orientiation + deleting pages does not seem to fix the issue.

Is this an error than anyone else has encountered / figured out a resolution for?

neil-sola avatar Dec 02 '24 18:12 neil-sola

Found the specific issue: it is possible for a LAYOUT_FIGURE to have "Relationships": null which breaks this function:

Example:

{"BlockType":"LAYOUT_FIGURE","ColumnIndex":null,"ColumnSpan":null,"Confidence":94.62890625,"EntityTypes":null,"Geometry":{"BoundingBox":{"Height":0.04673086851835251,"Left":0.06788529455661774,"Top":0.8822278380393982,"Width":0.4904918074607849},"Polygon":[{"X":0.06790152192115784,"Y":0.8822278380393982},{"X":0.5583770871162415,"Y":0.8828750252723694},{"X":0.558368444442749,"Y":0.9289587140083313},{"X":0.06788529455661774,"Y":0.9283040761947632}]},"Hint":null,"Id":"4859938e-4c4a-46bb-b40c-34d93486b824","Page":1,"PageClassification":null,"Query":null,"Relationships":null,"RowIndex":null,"RowSpan":null,"SelectionStatus":null,"Text":null,"TextType":null},

neil-sola avatar Dec 03 '24 23:12 neil-sola