amazon-textract-textractor KeyError: 'Text'

KeyError: 'Text' - on documents with tables

Open dzmitry-kankalovich opened this issue 10 months ago • 1 comments

Hello,

I have a fairly normal looking document (for which I unfortunately cannot share original file as its a proprietary doc) that textractprettyprinter.t_pretty_print.get_text_from_layout_json fails to parse with KeyError: 'Text'.

We've traced it to the following problem:

The document in question contains a screenshot of a table, that has a selection in one of the cells: problem_root_cause

This in turn is suspected to trigger an error at this line:

File "/app/.venv/lib/python3.11/site-packages/textractprettyprinter/t_pretty_print_layout.py", line 111, in _dfs
    cell_text = " ".join([id2block[line_id]['Text'] for line_id in cell_block["Relationships"][0]['Ids']])

If we inspect the root cause (I've added the try-catch to the original source file): error

It appears that the branch of code in _dfs() function that handles tables should add a check for the blocks that cell is referencing that they actually contain Text property (or alternatively use something like .get('Text',''))

Mar 28 '24 15:03 dzmitry-kankalovich

Opened PR with a fix: https://github.com/aws-samples/amazon-textract-textractor/pull/344

Mar 28 '24 15:03 dzmitry-kankalovich

amazon-textract-textractor amazon-textract-textractor copied to clipboard

KeyError: 'Text' - on documents with tables

amazon-textract-textractor
amazon-textract-textractor copied to clipboard