amazon-textract-textractor
amazon-textract-textractor copied to clipboard
Landscape tables result in jumbled text when using extractor.start_document_analysis with TextractFeatures.TABLES
❌ Tested on both v.1.8.5 & v1.9.0 and both fail
Example Document: Page 8 & 9 of this document (07432326.pdf) have tables in landscape
Expected:
| Broxtowe Borough Council | Foster Avenue, Beeston, Nottingham, NG9 1AB |
|---|
Actual:
| Council Borough Broxtowe | 1AB NG9 Nottingham, Beeston, Avenue, Foster |
|---|
[!NOTE]
This issue does not exist on portrait tables
Full textraction: 07432326_ocr.txt
extractor = Textractor()
document = extractor.start_document_analysis(
file_source=xxxx,
save_image=False,
features=[TextractFeatures.TABLES],
s3_upload_path=xxxx,
)
return document.response
[!IMPORTANT]
✅ This used to work on v1.4.5 - here's the same document on that version
Example extraction: v1.4.5.txt
| Broxtowe Borough Council | Foster Avenue, Beeston, Nottingham, NG9 1AB |
|---|
and we actually get:
| Broxtowe Borough Council | Foster Avenue, Beeston, Nottingham, NG9 1AB |
|---|
Here's a diff between textractor.py v.1.4.5 (left) and v.1.9.0 (right) https://www.diffchecker.com/4JsE2FLv/
Thank you for the thorough and nicely formatted issue. My hunch would be that there is sorting happening somewhere that incorrectly assumes that all text is left to right and top to bottom, irrespective of the rectification.
It's likely something that changed in response parser.py, table.py or table_cell.py. Will try to take a look this week.
Hi, any updates on this or ideas for workarounds? We're seeing this a lot and it makes textract / textractor effectively useless for our use case.
Hi @Beval, this may be of some help to you in diagnosing: I ran a git bisect and the bad commit was flagged as this one
Commit: https://github.com/aws-samples/amazon-textract-textractor/commit/f471a532d2d2aecd72a1c8150731d30d9ce672ab PR: https://github.com/aws-samples/amazon-textract-textractor/pull/265
I believe this was released in v1.5.0 which marries with us not seeing the issue in 1.4.5 but did see it when we updated to 1.8.5 https://github.com/aws-samples/amazon-textract-textractor/releases/tag/v1.5.0