amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Text is extracted but not grouped into forms and tables correctly

Open sichen1234 opened this issue 4 years ago • 1 comments

We're starting an invoice processing project and really like this library, but we're having one interesting issue: The text is all parsed correctly, but then it is not always grouped into forms and tables correctly.

So for example we could have this block in the text file:

SOLD TO: SHIP TO: CUSTOMER COMPANY JANE RECIPIENT 123 SOMEWHERE ST 987 ANOTHER PL LOS ANGELES CA 90001 SAN FRANCISCO CA 94100 USA USA

But then it does not show up in a SOLD TO: or SHIP TO: forms or tables.

When it does show up, the confidence level is low (around 37), but the confidence of the text itself is very high: "BlockType": "LINE", "Confidence": 99.85072326660156,

Is this a problem of the relative position of the text versus their labels?

Should we try to adjust the forms/tables parsing algorithm? Or should we just work with the text and try to go with the repeating patterns of text, and not worry about forms and tables?

sichen1234 avatar Jul 03 '20 17:07 sichen1234

Sorry for the late response. Could you post a sample image to test?

schadem avatar Dec 09 '20 23:12 schadem

Closing for inactivity.

Belval avatar Mar 08 '24 13:03 Belval