paper-qa Any thoughts on OCR for older papers? (image-only)

Any thoughts on OCR for older papers? (image-only)

Open sgbaird opened this issue 2 years ago • 7 comments

EDIT: a related OCR/NLP avenue

Feb 27 '23 23:02 sgbaird

Go for it - https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf

Feb 28 '23 02:02 whitead

@sgbaird did you already try the one from unstructured.io?

I think OCR Cognitive Service API is also quite strong https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-read?view=form-recog-3.0.0

Mar 05 '23 01:03 usuyama

@usuyama not yet. Thanks for the suggestion! @ramseyissa and @hasan-sayeed are taking point on the project - taking https://mpds.io data and training a model to learn to extract that data from the full texts.

Mar 05 '23 04:03 sgbaird

Not directly related to this discussion but I recently stumbled upon docTR, which might interest everyone here regarding OCR.

Mar 05 '23 12:03 thiswillbeyourgithub

we can also go for aws textract detect text api

Mar 30 '23 23:03 ghost

Btw, I have been successful at using tesseract (with the right parameters) and then sending the text to ChatGPT for cleanup. It cost very little and was actually great at correcting pretty much all spelling mistakes and even enhancing the formatting (fix indentation etc).

On the other hand docTR proved quite disappointing to me : it's probably great for everything that is NOT a screenshot (handwritten, picture with an angle etc)

Mar 31 '23 15:03 thiswillbeyourgithub

I uses aws textract on day to day bases .It work pretty well on handwritten data.

Mar 31 '23 19:03 ghost

paper-qa paper-qa copied to clipboard

Any thoughts on OCR for older papers? (image-only)

paper-qa
paper-qa copied to clipboard