amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

KeyError exception in Python trp package when parsing a page that doesn't have a Polygon element

Open paultipper opened this issue 2 years ago • 1 comments

A 12 page PDF document was processed by Textract, and I'm trying to use this package to parse the resulting response.json. The very first is a PAGE block that has the following Geometry element:

{
    "DocumentMetadata": { "Pages": 12 },
    "JobStatus": "SUCCEEDED",
    "NextToken": "RYAd635ujGFqn4t5XLy4H+7BT1mguxFfHvBA8pGfJ3C9FnC8Pv7Cz/+qj+v/MisnIcNR7fwh+/CfJVGIdHn/sSplCQcE2ra4ZXjtDJ9SIp6Z9v5ICHmkzGNrVtS4m4GG",
    "Blocks": [
      {
        "BlockType": "PAGE",
        "Geometry": {
          "BoundingBox": {
            "Width": 1.0,
            "Height": 1.0,
            "Left": 0.0,
            "Top": 0.0
          }
        },
        "Id": "e5413485-55aa-405c-b547-25d6f3db1251",
       "...","...."
  }]}

I've loaded the response into a dictionary and then tried to instantiate the Document class, passing the document dictionary to the constructor; when I do so, I get the following error:

./tests/TextractOutputProcessor_test.py::test_processResponseJson Failed: [undefined]KeyError: 'Polygon'
responseJsonFile = './tests/textract/response.json'

    def test_processResponseJson(responseJsonFile):
        """Test the processResponseJson method"""
    
        assert isinstance(responseJsonFile, str)
        processor = TextractOutputProcessor()
    
        try:
>           processor.loadResponseJson(responseJsonFile)

tests/TextractOutputProcessor_test.py:17: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
TextractOutputProcessor.py:24: in loadResponseJson
    self.document = Document(self.metadata)
venv/lib/python3.8/site-packages/trp/__init__.py:638: in __init__
    self._parse()
venv/lib/python3.8/site-packages/trp/__init__.py:675: in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
venv/lib/python3.8/site-packages/trp/__init__.py:522: in __init__
    self._parse(blockMap)
venv/lib/python3.8/site-packages/trp/__init__.py:533: in _parse
    self._geometry = Geometry(item['Geometry'])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <trp.Geometry object at 0x7fe2a06e1910>
geometry = {'BoundingBox': {'Height': 1.0, 'Left': 0.0, 'Top': 0.0, 'Width': 1.0}}

    def __init__(self, geometry):
        boundingBox = geometry["BoundingBox"]
>       polygon = geometry["Polygon"]
E       KeyError: 'Polygon'

venv/lib/python3.8/site-packages/trp/__init__.py:111: KeyError

It seems that the Geometry class expects there to be a Polygon element within every Geometry element in the response JSON, even though Textract did not create such an element when it processed my PDF document.

paultipper avatar Aug 18 '22 10:08 paultipper

Can you share the document? You are correct, it is optional and should not be be accepted according to https://docs.aws.amazon.com/textract/latest/dg/API_Geometry.html, but I never saw one without Polygon, so that would be very interesting to see. @paultipper

schadem avatar Dec 02 '22 00:12 schadem