amazon-textract-response-parser Tables spanning pages from PDFs with more than one table spanning pages are not merged correctly

Tables spanning pages from PDFs with more than one table spanning pages are not merged correctly

Open jshipway opened this issue 3 years ago • 2 comments

The code sample below works well for merging a single table that spans multiple pages, but we cannot get it to fully work when there are many tables in a document that span multiple pages. If the first table spans multiple pages it is merged correctly, but subsequent tables are not merged together when they span multiple pages. From https://github.com/aws-samples/amazon-textract-multipage-tables-processing, here is the code we are using for the test:

textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.TABLES], boto3_textract_client = textract_client) t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE) json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

To test this, just produce a PDF with two tables (I have attached a test PDF document to this issue), the first table spanning pages one and two, and the second table spanning pages two and three. In our test, the merge of the first table works fine, but the merge of the second table does not work, and our final result is three tables rather than 2.

test_textract_tables.pdf

Dec 15 '21 21:12 jshipway

Adding another test case. principal.pdf

Dec 16 '21 18:12 jshipway

@jshipway is this issue resolved now?

Jun 14 '22 11:06 tb102122

amazon-textract-response-parser amazon-textract-response-parser copied to clipboard

Tables spanning pages from PDFs with more than one table spanning pages are not merged correctly

amazon-textract-response-parser
amazon-textract-response-parser copied to clipboard