s3-ocr
s3-ocr copied to clipboard
Tools for running OCR against files stored in S3
Using these (more expensive) APIs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_analysis
First of all, I love this, thanks for creating s3-ocr! Everything works as it should, following the instructions in your TIL https://simonwillison.net/2022/Jun/30/s3-ocr/ ...except that the very first PDF file I...
I just noticed that once you get down to the `WORD` blocks in the Textract output you see stuff this: ```json { "BlockType": "WORD", "ColumnIndex": null, "ColumnSpan": null, "Confidence": 99.53694915771484,...
The tool only handles PDFs right now, but AWS Textract can handle other formats (including regular images).
This is actually quite difficult. It turns out the `textract-output/JOB_ID` folder is created, empty, early on in the process. Then files called `1` and `2` and so-on are added to...
Might be less messy than scattering those `.s3-ocr.json` files all other the place. Would also let me fetch all of the files Janine go with a prefix fetch against `/s3-ocr/`.
Would still require a bucket since PDFs through Textract need to go through a bucket. Maybe has an option to block and poll for completion? Default operation can be to...
Right now they both output a blank successful result, which is wrong.