amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Landscape tables result in jumbled text when using extractor.start_document_analysis with TextractFeatures.TABLES

Open gertct opened this issue 8 months ago • 3 comments

❌ Tested on both v.1.8.5 & v1.9.0 and both fail

Example Document: Page 8 & 9 of this document (07432326.pdf) have tables in landscape

Expected:

Broxtowe Borough Council Foster Avenue, Beeston, Nottingham, NG9 1AB

Actual:

Council Borough Broxtowe 1AB NG9 Nottingham, Beeston, Avenue, Foster

[!NOTE]
This issue does not exist on portrait tables

Full textraction: 07432326_ocr.txt

extractor = Textractor()

document = extractor.start_document_analysis(
                    file_source=xxxx,
                    save_image=False,
                    features=[TextractFeatures.TABLES],
                    s3_upload_path=xxxx,
                )

return document.response

[!IMPORTANT]
✅ This used to work on v1.4.5 - here's the same document on that version

Example extraction: v1.4.5.txt

Broxtowe Borough Council Foster Avenue, Beeston, Nottingham, NG9 1AB

and we actually get:

Broxtowe Borough Council Foster Avenue, Beeston, Nottingham, NG9 1AB

Here's a diff between textractor.py v.1.4.5 (left) and v.1.9.0 (right) https://www.diffchecker.com/4JsE2FLv/

gertct avatar Mar 11 '25 17:03 gertct

Thank you for the thorough and nicely formatted issue. My hunch would be that there is sorting happening somewhere that incorrectly assumes that all text is left to right and top to bottom, irrespective of the rectification.

It's likely something that changed in response parser.py, table.py or table_cell.py. Will try to take a look this week.

Belval avatar Mar 13 '25 14:03 Belval

Hi, any updates on this or ideas for workarounds? We're seeing this a lot and it makes textract / textractor effectively useless for our use case.

benjaminsims avatar Apr 08 '25 11:04 benjaminsims

Hi @Beval, this may be of some help to you in diagnosing: I ran a git bisect and the bad commit was flagged as this one

Commit: https://github.com/aws-samples/amazon-textract-textractor/commit/f471a532d2d2aecd72a1c8150731d30d9ce672ab PR: https://github.com/aws-samples/amazon-textract-textractor/pull/265

I believe this was released in v1.5.0 which marries with us not seeing the issue in 1.4.5 but did see it when we updated to 1.8.5 https://github.com/aws-samples/amazon-textract-textractor/releases/tag/v1.5.0

gertct avatar Apr 09 '25 10:04 gertct