amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Textract missing most of the text in documents.

Open geoffwalmsley opened this issue 4 years ago • 1 comments

I am processing some fairly simple pdfs from S3 using textract document detection. For most of these documents, the returned JSON contains very little text. For example, using the pdf located here, returns only 778 words and 127 lines from a 96 page pdf. Also note that the pdf is very simple in structure so I'm very confused why textract is less effective than something like PyPDF2.

Please let me know if there is something I'm missing here? Also, if there is any data on the types of documents that are suitable/not suitable for textract would be helpful.

Many thanks.

geoffwalmsley avatar Aug 09 '20 07:08 geoffwalmsley

I just ran a test on the document you linked, an 80 page 'Infrastructure Funding and Financing Bill' through Textract and got 31856 words identified, which seems to cover the text in the document.

You mention a 96 page document, maybe those are different.

Compared to PyPDF2, Textract not only allows for PDF input, but also images in the JPEG or PNG format.

Regarding the documents that are suitable/not suitable, let me point you to the best practices guide: https://docs.aws.amazon.com/textract/latest/dg/textract-best-practices.html

Hope this helps.

schadem avatar Dec 09 '20 23:12 schadem

Closing for inactivity.

Belval avatar Mar 08 '24 13:03 Belval