amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

Tables spanning pages from PDFs with more than one table spanning pages are not merged correctly

Open jshipway opened this issue 3 years ago • 2 comments

The code sample below works well for merging a single table that spans multiple pages, but we cannot get it to fully work when there are many tables in a document that span multiple pages. If the first table spans multiple pages it is merged correctly, but subsequent tables are not merged together when they span multiple pages. From https://github.com/aws-samples/amazon-textract-multipage-tables-processing, here is the code we are using for the test:

textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.TABLES], boto3_textract_client = textract_client) t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE) json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

To test this, just produce a PDF with two tables (I have attached a test PDF document to this issue), the first table spanning pages one and two, and the second table spanning pages two and three. In our test, the merge of the first table works fine, but the merge of the second table does not work, and our final result is three tables rather than 2.

test_textract_tables.pdf

jshipway avatar Dec 15 '21 21:12 jshipway

Adding another test case. principal.pdf

jshipway avatar Dec 16 '21 18:12 jshipway

@jshipway is this issue resolved now?

tb102122 avatar Jun 14 '22 11:06 tb102122