amazon-textract-textractor issue with extraction, get_text_fromlayout

issue with extraction, get_text_fromlayout_json function

Open red-sky17 opened this issue 1 year ago • 1 comments

attached the part of the pdf, which I am trying to extract.

I am doing extraction using: textract_json = call_textract(input_document="s3:url", features=[Textract_Features.LAYOUT, Textract_Features.TABLES]) layout = get_text_from_layout_json(textract_json=data)

the output I am getting is:

I analysed this in textract console, there it was able to detect two tables and everything clearly analyzed over there.

and I was able to extract this, when I am loading the json to ( textractor.entities.document import Document ) the Document and get the results using document.text but the extracted tables are not bordered when I am using this function.

I will try to resolve this from my end, but if I am missing anything or anyone already working on this, I request and appreciate all the help.

Thankyou.

Apr 15 '24 12:04 red-sky17

For markdown bordering you can use the MarkdownLinearizationConfig by calling .to_markdown() on your document object.

https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html#All-entities-can-be-linearized

May 06 '24 14:05 Belval

amazon-textract-textractor amazon-textract-textractor copied to clipboard

issue with extraction, get_text_fromlayout_json function

amazon-textract-textractor
amazon-textract-textractor copied to clipboard