amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

Better *InReadingOrder APIs

Open athewsey opened this issue 4 years ago • 1 comments

Hi folks & thanks for your work maintaining TRP.

Using the tool to post-process Textract results, I find that the idea of the getLinesInReadingOrder function really useful... but the returned data model today is frustratingly unhelpful!

What I'd really like is methods that return the actual Line or Word objects (rather than just text), so I can still access things like the block IDs and geometries.

Today, the getTextInReadingOrder() method just returns a text string and the getLinesInReadingOrder() method returns a (particularly un-intuitive) list of [ColumnId, LineText] pairs.

  1. It doesn't make sense to me that just text instead of the full objects are returned, given the method name is getLines... and not e.g. getLineText...
  2. The concept of columns is an implementation detail of getLinesInReadingOrder() and should either be: a. Explicitly committed to by docstring and method renaming e.g. getLineTextsByColumn(), or b. Recognised as an internal heuristic and hidden from the output.

I also see that the column detection seems pretty simple as it's implemented so far and likely to do some weird things on documents like forms or posters that might have less vertically-static column layouts down the page.

So would ask:

  • How open/resistant would we be to making breaking changes to the existing getLinesInReadingOrder API? to try and bring the naming and functionality closer together?
  • What's the perspective on documents with more advanced not-quite-columns structure: Is the raw order of tokens output from Textract likely to be a better approximation of the reading order? Is there appetite to develop more sophisticated rules in TRP or not really as the complexity makes it a bit of a losing battle?

athewsey avatar Jan 16 '21 13:01 athewsey

I just added a serializer/deserializer for the Textract JSON response with an example of ordering the items in the response transparent on an object and serializing back to the Textract JSON format (see https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md). Does that help?

schadem avatar Apr 21 '21 02:04 schadem