amazon-textract-response-parser Not able to extract Textract merge cell text properly

Not able to extract the merge cell text properly. There is some issue with combine headers function. Textract not able to extract the top header text properly.

Reference: t_doc = TDocumentSchema().load(textract_json) ordered_doc = order_blocks_by_geo(t_doc) trp_doc = Document(TDocumentSchema().dump(ordered_doc)) Now let’s iterate through the tables’ content, and extract the data into a DataFrame:

table_index = 1 dataframes = [] def combine_headers(top_h, bottom_h): bottom_h[3] = top_h[2] + " " + bottom_h[3] bottom_h[4] = top_h[2] + " " + bottom_h[4] for page in trp_doc.pages: for table in page.tables: table_data = [] headers = table.get_header_field_names() #New Table method to retrieve header column names if(len(headers)>0): #Let's retain the only table with headers print("Statememt headers: "+ repr(headers)) top_header= headers[0] bottom_header = headers[1] combine_headers(top_header, bottom_header) #The statement has two headers. let's combine them for r, row in enumerate(table.rows_without_header): #New Table attribute returning rows without headers table_data.append([]) for c, cell in enumerate(row.cells): table_data[r].append(cell.mergedText) #New Cell attribute returning merged cells common values if len(table_data)>0: df = pd.DataFrame(table_data, columns=bottom_header)

Document table format:

with above logic:

With small changes in the combine header, my issue got solved to some extent:

def combine_headers(top_h, bottom_h):
    for i in range(len(top_h)):
        if bottom_h[i] != top_h[i]:
            bottom_h[i] = top_h[i] + ' ' + bottom_h[i] 
        else :
            bottom_h[i] = bottom_h[i]

But there is some issue with textract top header detection,

Jun 30 '22 18:06 sravzmum

We should add an option to pass in a function that can be used instead of the fixed logic.

Jul 11 '22 17:07 schadem

@sravzmum are you able to provide a sample document. I agree with the option at one stage we could even extend it to except custom functions for processing.

Jul 11 '22 23:07 tb102122

I am also getting the above issue while merging the top and bottom headers. Some part of the column name in the top header is getting missed in some scenarios. Request your guidance on the same.

Sep 07 '22 17:09 prasum

@prasum Do you have a sample document you can share? Do you get the correct results from textract in the ocr step?

Sep 07 '22 20:09 tb102122

sorry for the late reply. The sample document I would not be able to share due to internal restrictions. The scenario is same as the above document shared by @sravzmum . Yes I am able to get the correct results from textract in the ocr step

Sep 09 '22 17:09 prasum

We would need some sort of example otherwise we cant help.

Sep 12 '22 00:09 tb102122

4c32d660-37af-4ad6-80e8-56695084c828fig_abf2fd2d-c49c-4fd2-8640-5cdcd1949a9f

when i toggle to "merge cells" on the AWS Textract, i get perfect table, but when i download it, or call through api and parse it

it unmerges cells

Screenshot from 2024-05-14 18-25-55

May 14 '24 12:05 mukul-llmate

amazon-textract-response-parser amazon-textract-response-parser copied to clipboard

Not able to extract Textract merge cell text properly

amazon-textract-response-parser
amazon-textract-response-parser copied to clipboard