amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

How can I order the results as shown in the pdf?

Open robertdac opened this issue 3 years ago • 2 comments

PDF

Captura de pantalla de 2022-05-05 15-38-43

Example : python3 textractor.py --documents s3://mybucket/mydoc.pdf --forms

Result :

62692bb61ab53-pdf-page-1-forms.csv

Captura de pantalla de 2022-05-05 16-00-35

how can i order this way

Captura de pantalla de 2022-05-05 16-04-33

robertdac avatar May 05 '22 14:05 robertdac

@robertdac What kind of ordering do you like to apply? Do you need to use the command line or can you also write your own python code?

Cheers Tobias

tb102122 avatar May 05 '22 21:05 tb102122

Did you check the output like described on the page https://github.com/aws-samples/amazon-textract-textractor "document-page-n-text-inreadingorder.txt: Detected text in reading order (multi-column) for each page in the document."

tb102122 avatar May 05 '22 22:05 tb102122

Hi @tb102122 I'm facing a similar issue with key values exported as csv using a python script.

The checkboxes and key values do not appear to be in any specific order. As there are quite a few duplicate checkboxes (e.g Yes/No) was hoping to be able to format if possible left to right, top to bottom.

document = extractor.start_document_analysis( file_source=("Application Form trimmed.pdf"), features=[TextractFeatures.FORMS], s3_upload_path="s3://textractbucket2/" ) document.export_kv_to_csv( include_kv=True, include_checkboxes=True, filepath="async_kv.csv" )

syley avatar Oct 19 '22 22:10 syley

@syley that example should help you but in your case it sounds a but more complex. https://github.com/aws-samples/amazon-textract-textractor/blob/master/tpipelinegeofinder/geofinder-sample-notebook.ipynb

tb102122 avatar Oct 20 '22 02:10 tb102122