amazon-textract-textractor
amazon-textract-textractor copied to clipboard
Textract missing most of the text in documents.
I am processing some fairly simple pdfs from S3 using textract document detection. For most of these documents, the returned JSON contains very little text. For example, using the pdf located here, returns only 778 words and 127 lines from a 96 page pdf. Also note that the pdf is very simple in structure so I'm very confused why textract is less effective than something like PyPDF2.
Please let me know if there is something I'm missing here? Also, if there is any data on the types of documents that are suitable/not suitable for textract would be helpful.
Many thanks.
I just ran a test on the document you linked, an 80 page 'Infrastructure Funding and Financing Bill' through Textract and got 31856 words identified, which seems to cover the text in the document.
You mention a 96 page document, maybe those are different.
Compared to PyPDF2, Textract not only allows for PDF input, but also images in the JPEG or PNG format.
Regarding the documents that are suitable/not suitable, let me point you to the best practices guide: https://docs.aws.amazon.com/textract/latest/dg/textract-best-practices.html
Hope this helps.
Closing for inactivity.