amazon-textract-textractor
amazon-textract-textractor copied to clipboard
issue with extraction, get_text_fromlayout_json function
attached the part of the pdf, which I am trying to extract.
I am doing extraction using: textract_json = call_textract(input_document="s3:url", features=[Textract_Features.LAYOUT, Textract_Features.TABLES]) layout = get_text_from_layout_json(textract_json=data)
the output I am getting is:
I analysed this in textract console, there it was able to detect two tables and everything clearly analyzed over there.
and I was able to extract this, when I am loading the json to ( textractor.entities.document import Document ) the Document and get the results using document.text but the extracted tables are not bordered when I am using this function.
I will try to resolve this from my end, but if I am missing anything or anyone already working on this, I request and appreciate all the help.
Thankyou.
For markdown bordering you can use the MarkdownLinearizationConfig by calling .to_markdown() on your document object.
https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html#All-entities-can-be-linearized