amazon-textract-response-parser
amazon-textract-response-parser copied to clipboard
Tables spanning pages from PDFs with more than one table spanning pages are not merged correctly
The code sample below works well for merging a single table that spans multiple pages, but we cannot get it to fully work when there are many tables in a document that span multiple pages. If the first table spans multiple pages it is merged correctly, but subsequent tables are not merged together when they span multiple pages. From https://github.com/aws-samples/amazon-textract-multipage-tables-processing, here is the code we are using for the test:
textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.TABLES], boto3_textract_client = textract_client)
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)
To test this, just produce a PDF with two tables (I have attached a test PDF document to this issue), the first table spanning pages one and two, and the second table spanning pages two and three. In our test, the merge of the first table works fine, but the merge of the second table does not work, and our final result is three tables rather than 2.
Adding another test case. principal.pdf
@jshipway is this issue resolved now?