camelot Add OCR support

Add OCR support

Open vinayak-mehta opened this issue 4 years ago • 6 comments

The experimental version exists before this commit 9753889. It uses Tesseract (using pyocr). ocropy looked promising the last time I checked, opening this issue for discussion and experiments around OCR.

Jul 04 '19 21:07 vinayak-mehta

Hi, is there any update about the OCR support?

Aug 14 '20 03:08 belisards

I hope to do an experiment soon with https://github.com/JaidedAI/EasyOCR.

Aug 14 '20 11:08 vinayak-mehta

You could check out OCRmyPDF. Apart from performing OCR, it can deskew/dewarp images (using leptonica). I've used it myself and the results are pretty good but, idk how it performs against EasyOCR. OCRmyPDF does have a dependency on Ghostscript though

Sep 21 '20 02:09 suyashb95

I was able to get nice results on some images with EasyOCR: https://vinayak.io/2020/09/20/day-29-easyocr-dabblements/ I might try working on a PR to integrate it with the code I mention in the first comment on this issue.

Sep 21 '20 08:09 vinayak-mehta

If camelot can offer an entry function that receives a list of words with their bounding boxes coordinates, it will facilitate the integration of any OCR tool that delivers these info, like Tesseract or EasyOCR, others as well.

pdfminer parsing of an OCR PDF like one produced with OCRmyPDF, merges columns frequently, even when you see the column cells very apart in the OCR PDF.

Apr 29 '21 06:04 javiqm12

If camelot can offer an entry function that receives a list of words with their bounding boxes coordinates

@javiqm12 You can specify table areas and regions with camelot right now, are you referring to another way to provide bounding box coordinates?

Jun 14 '21 20:06 vinayak-mehta

camelot camelot copied to clipboard

Add OCR support

camelot
camelot copied to clipboard