amazon-textract-textractor
amazon-textract-textractor copied to clipboard
textractcaller - Allow local file path input for PDF
Calling the function with a filepath will yield a misleading error:
file_name = './Invoice_INV300351.pdf'
response = call_textract(input_document=file_name , boto3_textract_client=client )
error:
Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 143, in run
self.processDocument(ips, i, document)
File "textractor.py", line 98, in processDocument
dp = DocumentProcessor(ips["bucketName"], document, ips["awsRegion"], ips["text"], ips["forms"], ips["tables"])
File "/mnt/c/Users/Username/Documents/textractor/tdp.py", line 218, in init
raise Exception("PDF must be in S3 bucket.")
Exception: PDF must be in S3 bucket.
however you can pass it into the function as a bytes object successfully:
file_name = './Invoice_INV300351.pdf'
client = boto3.client('textract', 'us-east-1')
with open(file_name, "rb") as sample_file:
b = bytearray(sample_file.read())
response = call_textract(input_document=b, boto3_textract_client=client )
https://github.com/aws-samples/amazon-textract-textractor/blob/master/caller/textractcaller/t_call.py
it looks like the quick fix might be killing this code:
is_pdf: bool = (ext != None and ext.lower() in only_async_suffixes)
if is_pdf and not is_s3_document:
raise ValueError("PDF only supported when located on S3")