amazon-textract-textractor Text is extracted but not grouped into forms and tables correctly

Text is extracted but not grouped into forms and tables correctly

Open sichen1234 opened this issue 4 years ago • 1 comments

We're starting an invoice processing project and really like this library, but we're having one interesting issue: The text is all parsed correctly, but then it is not always grouped into forms and tables correctly.

So for example we could have this block in the text file:

SOLD TO: SHIP TO: CUSTOMER COMPANY JANE RECIPIENT 123 SOMEWHERE ST 987 ANOTHER PL LOS ANGELES CA 90001 SAN FRANCISCO CA 94100 USA USA

But then it does not show up in a SOLD TO: or SHIP TO: forms or tables.

When it does show up, the confidence level is low (around 37), but the confidence of the text itself is very high: "BlockType": "LINE", "Confidence": 99.85072326660156,

Is this a problem of the relative position of the text versus their labels?

Should we try to adjust the forms/tables parsing algorithm? Or should we just work with the text and try to go with the repeating patterns of text, and not worry about forms and tables?

Jul 03 '20 17:07 sichen1234

Sorry for the late response. Could you post a sample image to test?

Dec 09 '20 23:12 schadem

Closing for inactivity.

Mar 08 '24 13:03 Belval

amazon-textract-textractor amazon-textract-textractor copied to clipboard

Text is extracted but not grouped into forms and tables correctly

amazon-textract-textractor
amazon-textract-textractor copied to clipboard