amazon-textract-textractor Issue with multipage PDFs on s3 without extension

Issue with multipage PDFs on s3 without extension

Open lvieirajr opened this issue 1 year ago • 2 comments

Hello, first of all thanks for the awesome package.

I am currently having an issue trying to run textractor on my PDFs that are stored in s3. The issue stems from the fact that all my files (for security, and other reasons, which I think are pretty common practice at larger enterprises) are stored as UUIDs instead of their actual filename so when call_textract is called, it goes through the entire process without actually hitting any of the if statements and just returns an empty dict.

Is there any way that maybe this use case could be supported?

Feb 14 '24 05:02 lvieirajr

Makes sense. I'll add a flag to force a specific mime type. @lvieirajr

Feb 14 '24 09:02 schadem

Published as part of 0.2.2. for the caller. Assigning this to @Belval to add this ability to the Textractor as well.

Feb 14 '24 10:02 schadem

amazon-textract-textractor amazon-textract-textractor copied to clipboard

Issue with multipage PDFs on s3 without extension

amazon-textract-textractor
amazon-textract-textractor copied to clipboard