amazon-textract-response-parser
amazon-textract-response-parser copied to clipboard
Improve error messages for missing blocks when parsing incomplete JSON
Hi, My customer is receiving below error when using the textractor with a large multi-page pdf file.
899858907a773d1d5932a263c039a8fced6b281b0e716fbd31366bff7c4392c
Traceback (most recent call last):
File "C:\Users\YADAVA66\PycharmProjects\pythonProject\main.py", line 80, in <module>
doc = Document(response)
File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 633, in __init__
self._parse()
File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 667, in _parse
page = Page(documentPage["Blocks"], self._blockMap)
File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 516, in __init__
self._parse(blockMap)
File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 530, in _parse
l = Line(item, blockMap)
File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 142, in __init__
if(blockMap[cid]["BlockType"] == "WORD"):
KeyError: '5e06e009-03ac-42cc-9abf-4df8f606c2af'
This is no bug, instead the JSON passed to the trp is not complete and therefore missing an id that is referenced. Usually this happens when an asychronous API is called (Start*) and the result is paginated and then only the first JSON response block is used. Use the get_full_json_from_output_config or get_full_json from the https://pypi.org/project/amazon-textract-caller/ to get the full JSON object and pass that to the textract-response parser. Keeping this issue to remind me updating the error message and pointing to this and recommend getting the full JSON.