amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

textractcaller - Allow local file path input for PDF

Open grantrosse opened this issue 1 year ago • 0 comments

Calling the function with a filepath will yield a misleading error:

file_name = './Invoice_INV300351.pdf'
response = call_textract(input_document=file_name , boto3_textract_client=client )

error:

Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 143, in run
self.processDocument(ips, i, document)
File "textractor.py", line 98, in processDocument
dp = DocumentProcessor(ips["bucketName"], document, ips["awsRegion"], ips["text"], ips["forms"], ips["tables"])
File "/mnt/c/Users/Username/Documents/textractor/tdp.py", line 218, in init
raise Exception("PDF must be in S3 bucket.")
Exception: PDF must be in S3 bucket.

however you can pass it into the function as a bytes object successfully:

file_name = './Invoice_INV300351.pdf'
client = boto3.client('textract', 'us-east-1')
with open(file_name, "rb") as sample_file:
    b = bytearray(sample_file.read())
response = call_textract(input_document=b, boto3_textract_client=client )

https://github.com/aws-samples/amazon-textract-textractor/blob/master/caller/textractcaller/t_call.py

it looks like the quick fix might be killing this code:

is_pdf: bool = (ext != None and ext.lower() in only_async_suffixes)
if is_pdf and not is_s3_document:
     raise ValueError("PDF only supported when located on S3")

grantrosse avatar Aug 04 '23 19:08 grantrosse