s3-ocr issues

Options to do table, form and query extraction using get_document_analysis

7

Using these (more expensive) APIs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_analysis

simonw

enhancement

Not all pages are ocr'd, but Textract claims otherwise

3

First of all, I love this, thanks for creating s3-ocr! Everything works as it should, following the instructions in your TIL https://simonwillison.net/2022/Jun/30/s3-ocr/ ...except that the very first PDF file I...

captnswing

help wanted

Expose difference between HANDWRITING and PRINTED and so on

1

I just noticed that once you get down to the `WORD` blocks in the Textract output you see stuff this: ```json { "BlockType": "WORD", "ColumnIndex": null, "ColumnSpan": null, "Confidence": 99.53694915771484,...

simonw

enhancement

Support files other than PDFs

1

The tool only handles PDFs right now, but AWS Textract can handle other formats (including regular images).

simonw

enhancement

status command should show if OCR has completed

2

This is actually quite difficult. It turns out the `textract-output/JOB_ID` folder is created, empty, early on in the process. Then files called `1` and `2` and so-on are added to...

simonw

enhancement

Consider using /s3-ocr/key instead of key.s3-ocr.json

Might be less messy than scattering those `.s3-ocr.json` files all other the place. Would also let me fetch all of the files Janine go with a prefix fetch against `/s3-ocr/`.

simonw

enhancement

s3-ocr file command to process a single PDF

1

Would still require a bucket since PDFs through Textract need to go through a bucket. Maybe has an option to block and poll for completion? Default operation can be to...

simonw

enhancement

Running fetch and text against jobs that have not yet completed should show an error

Right now they both output a blank successful result, which is wrong.

simonw

bug

s3-ocr
s3-ocr copied to clipboard

Metadata

Options to do table, form and query extraction using get_document_analysis

Not all pages are ocr'd, but Textract claims otherwise

Expose difference between HANDWRITING and PRINTED and so on

Support files other than PDFs

status command should show if OCR has completed

Consider using /s3-ocr/key instead of key.s3-ocr.json

s3-ocr file command to process a single PDF

Running fetch and text against jobs that have not yet completed should show an error

← Metadata

Owner

Metadata

s3-ocr s3-ocr copied to clipboard

Metadata

← Metadata

Owner

Metadata

s3-ocr
s3-ocr copied to clipboard