amazon-textract-textractor
amazon-textract-textractor copied to clipboard
KeyError: 'Text' - on documents with tables
Hello,
I have a fairly normal looking document (for which I unfortunately cannot share original file as its a proprietary doc) that textractprettyprinter.t_pretty_print.get_text_from_layout_json
fails to parse with KeyError: 'Text'
.
We've traced it to the following problem:
The document in question contains a screenshot of a table, that has a selection in one of the cells:
This in turn is suspected to trigger an error at this line:
File "/app/.venv/lib/python3.11/site-packages/textractprettyprinter/t_pretty_print_layout.py", line 111, in _dfs
cell_text = " ".join([id2block[line_id]['Text'] for line_id in cell_block["Relationships"][0]['Ids']])
If we inspect the root cause (I've added the try-catch
to the original source file):
It appears that the branch of code in _dfs()
function that handles tables should add a check for the blocks that cell is referencing that they actually contain Text
property (or alternatively use something like .get('Text','')
)
Opened PR with a fix: https://github.com/aws-samples/amazon-textract-textractor/pull/344