paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Any thoughts on OCR for older papers? (image-only)

Open sgbaird opened this issue 2 years ago • 7 comments

EDIT: a related OCR/NLP avenue

sgbaird avatar Feb 27 '23 23:02 sgbaird

Go for it - https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf

whitead avatar Feb 28 '23 02:02 whitead

@sgbaird did you already try the one from unstructured.io?

I think OCR Cognitive Service API is also quite strong https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-read?view=form-recog-3.0.0

usuyama avatar Mar 05 '23 01:03 usuyama

@usuyama not yet. Thanks for the suggestion! @ramseyissa and @hasan-sayeed are taking point on the project - taking https://mpds.io data and training a model to learn to extract that data from the full texts.

sgbaird avatar Mar 05 '23 04:03 sgbaird

Not directly related to this discussion but I recently stumbled upon docTR, which might interest everyone here regarding OCR.

thiswillbeyourgithub avatar Mar 05 '23 12:03 thiswillbeyourgithub

we can also go for aws textract detect text api

ghost avatar Mar 30 '23 23:03 ghost

Btw, I have been successful at using tesseract (with the right parameters) and then sending the text to ChatGPT for cleanup. It cost very little and was actually great at correcting pretty much all spelling mistakes and even enhancing the formatting (fix indentation etc).

On the other hand docTR proved quite disappointing to me : it's probably great for everything that is NOT a screenshot (handwritten, picture with an angle etc)

thiswillbeyourgithub avatar Mar 31 '23 15:03 thiswillbeyourgithub

I uses aws textract on day to day bases .It work pretty well on handwritten data.

ghost avatar Mar 31 '23 19:03 ghost