amazon-textract-response-parser Improve error messages for missing blocks when parsing incomplete JSON

Improve error messages for missing blocks when parsing incomplete JSON

Open kkhator-aws opened this issue 1 year ago • 1 comments

Hi, My customer is receiving below error when using the textractor with a large multi-page pdf file.

899858907a773d1d5932a263c039a8fced6b281b0e716fbd31366bff7c4392c
Traceback (most recent call last):
  File "C:\Users\YADAVA66\PycharmProjects\pythonProject\main.py", line 80, in <module>
    doc = Document(response)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 633, in __init__
    self._parse()
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 530, in _parse
    l = Line(item, blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
KeyError: '5e06e009-03ac-42cc-9abf-4df8f606c2af'

Jun 12 '23 18:06 kkhator-aws

This is no bug, instead the JSON passed to the trp is not complete and therefore missing an id that is referenced. Usually this happens when an asychronous API is called (Start*) and the result is paginated and then only the first JSON response block is used. Use the get_full_json_from_output_config or get_full_json from the https://pypi.org/project/amazon-textract-caller/ to get the full JSON object and pass that to the textract-response parser. Keeping this issue to remind me updating the error message and pointing to this and recommend getting the full JSON.

Jun 12 '23 19:06 schadem

amazon-textract-response-parser amazon-textract-response-parser copied to clipboard

Improve error messages for missing blocks when parsing incomplete JSON

amazon-textract-response-parser
amazon-textract-response-parser copied to clipboard