amazon-textract-response-parser
amazon-textract-response-parser copied to clipboard
Better *InReadingOrder APIs
Hi folks & thanks for your work maintaining TRP.
Using the tool to post-process Textract results, I find that the idea of the getLinesInReadingOrder
function really useful... but the returned data model today is frustratingly unhelpful!
What I'd really like is methods that return the actual Line
or Word
objects (rather than just text), so I can still access things like the block IDs and geometries.
Today, the getTextInReadingOrder()
method just returns a text string and the getLinesInReadingOrder()
method returns a (particularly un-intuitive) list of [ColumnId, LineText]
pairs.
- It doesn't make sense to me that just text instead of the full objects are returned, given the method name is
getLines...
and not e.g.getLineText...
- The concept of columns is an implementation detail of
getLinesInReadingOrder()
and should either be: a. Explicitly committed to by docstring and method renaming e.g.getLineTextsByColumn()
, or b. Recognised as an internal heuristic and hidden from the output.
I also see that the column detection seems pretty simple as it's implemented so far and likely to do some weird things on documents like forms or posters that might have less vertically-static column layouts down the page.
So would ask:
- How open/resistant would we be to making breaking changes to the existing
getLinesInReadingOrder
API? to try and bring the naming and functionality closer together? - What's the perspective on documents with more advanced not-quite-columns structure: Is the raw order of tokens output from Textract likely to be a better approximation of the reading order? Is there appetite to develop more sophisticated rules in TRP or not really as the complexity makes it a bit of a losing battle?
I just added a serializer/deserializer for the Textract JSON response with an example of ordering the items in the response transparent on an object and serializing back to the Textract JSON format (see https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md). Does that help?