amazon-textract-textractor
amazon-textract-textractor copied to clipboard
Text is extracted but not grouped into forms and tables correctly
We're starting an invoice processing project and really like this library, but we're having one interesting issue: The text is all parsed correctly, but then it is not always grouped into forms and tables correctly.
So for example we could have this block in the text file:
SOLD TO: SHIP TO: CUSTOMER COMPANY JANE RECIPIENT 123 SOMEWHERE ST 987 ANOTHER PL LOS ANGELES CA 90001 SAN FRANCISCO CA 94100 USA USA
But then it does not show up in a SOLD TO: or SHIP TO: forms or tables.
When it does show up, the confidence level is low (around 37), but the confidence of the text itself is very high: "BlockType": "LINE", "Confidence": 99.85072326660156,
Is this a problem of the relative position of the text versus their labels?
Should we try to adjust the forms/tables parsing algorithm? Or should we just work with the text and try to go with the repeating patterns of text, and not worry about forms and tables?
Sorry for the late response. Could you post a sample image to test?
Closing for inactivity.