amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Exporting text+tables while maintaining layout

Open austinmw opened this issue 1 year ago • 1 comments

Supposed I have a document like this:

<text>

<table>

<text>

Where a table is located between two chunks of text, and I'd like to parse the document and save the parsed information, in order, to a text file.

If I use the document_analysis functionality, I can successfully extract the text and tables, and print them separately:

document = extractor.start_document_analysis(
    file_source=LOCAL_DOCUMENT_PATH,
    s3_upload_path=S3_UPLOAD_PATH,
    features=[TextractFeatures.LAYOUT, TextractFeatures.SIGNATURES, TextractFeatures.FORMS],
    save_image=True
)

print(document.text)
print(document.tables)

However, this loses information about the layout (i.e., that in my example, the table is in between two pieces of text).

So how can I print the parsed text+tables in order? As in something like:

print(document.text_and_tables)

Is there any convenience functionality in this library to do this?

austinmw avatar Apr 03 '24 21:04 austinmw

print(document.get_text()) gets you the text and tables in plain text in the order appeared in the doc If you want the tables in csv within the, you would have to tag the tables using the linearization config and rplace them with their csv counterpart gotten from the document.tables This notebook helps: https://github.com/aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation/blob/main/textract-api.ipynb

ucegbe avatar Apr 10 '24 07:04 ucegbe