excalibur icon indicating copy to clipboard operation
excalibur copied to clipboard

Integrate with Tesseract ocr and allow image uploading

Open harish2704 opened this issue 5 years ago • 10 comments
trafficstars

Tesseract can OCR an image and convert it into a PDF with image + text layer. So, if we integrate Tesseract with excalibur, we will be able to extract tables directly from Images.

I tried this idea and can confirm that it will work. A live POC version is available here http://ocr.harishk.in/ fork for this POC branch is here: https://github.com/harish2704/excalibur/tree/feat-tessearct-ocr-integration

But this integrations is kind of Hack. Tesseract execution will block current server thread. I request suggestion from Author for proper integration of this idea.

Once it is ready, I can send a PR

harish2704 avatar Jun 01 '20 19:06 harish2704

@harish2704 Looking forward to your PR. This is a important feature enhancement.

arky avatar Jul 25 '20 15:07 arky

@arky : The branch https://github.com/harish2704/excalibur/tree/feat-tessearct-ocr-integration specified in my first comment is already in a "working" condition. I could have sent an PR at that point, but I didn't do that because, I believe my implementation doesn't meet necessary quality as it doesn't complies with the architecture of this project. Thus, I need help / suggestion from Author of this project to integrate it properly .

harish2704 avatar Jul 25 '20 18:07 harish2704

@vinayak-mehta I hope you would review and guide @harish2704 when you have a free moment.

arky avatar Jul 26 '20 05:07 arky

Hey folks. Really sorry for the late reply.

@harish2704 Can you please add instructions on how to run your fork? Do I need to install tesseract and have it available on the PATH? It would also help if you can show us the extraction output on some image-based pdfs. You can use imagemagick / ghostscript on a pdf from the camelot tests dir to convert it to an image-based one.

I'm also planning to experiment with https://github.com/JaidedAI/EasyOCR soon.

vinayak-mehta avatar Aug 13 '20 20:08 vinayak-mehta

Hi @vinayak-mehta , Thanks for your time and response. Regarding your questions:

@harish2704 Can you please add instructions on how to run your fork? Do I need to install tesseract and have it available on the PATH?

I just added uploaded a dockerfile which I was using while deploying the demo site. Please see it here https://github.com/harish2704/excalibur/blob/feat-tessearct-ocr-integration/Dockerfile

It would also help if you can show us the extraction output on some image-based pdfs.

I just started the demo container again. You can see that in action here https://ocr.harishk.in/ . Feel free to test any files ( I actually made this when https://www.covid19india.org/ team faced difficulty to process daily press releases of certain states ( NB: this work didn't helped them much ). Thus I only enabled Hindi+English in the demo site. )

I'm also planning to experiment with https://github.com/JaidedAI/EasyOCR soon.

I have some experience with implementing OCR's ( https://github.com/harish2704/pottan-ocr ) and based on that I would say,

  1. Tesseract is the result of years of work and it has many powerful features.
    • Tesseract 4x+ version are using very powerfull Convolutional-recurrent networks ( which most of the Modern ML based OCRs using ) in a very CPU efficient way ( Neither a similar tensorflow / pytorch model can beat Tesseract on that in the case of CPU performance ). In short, it will be hard to find a ML based OCR which can reliably run as Web service. ( it will be OK if we have GPU )
  2. In the current implementation, We are using PDF annotation feature of Tesseract by which we covert images to OCR'ed PDF.

harish2704 avatar Aug 14 '20 18:08 harish2704

@harish2704 @vinayak-mehta OCRmyPDF seems to have implemented tesseract integration, worth taking a look https://github.com/jbarlow83/OCRmyPDF

arky avatar Aug 24 '20 16:08 arky

Hi @arky : Thanks for your message.

I checked the project you mentioned. The description of the project says

"OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched"

In my understanding, this is an built-in feature of Tesseract OCR. That is, if we set output format as PDF, we will get a PDF with additional Text layer.

Thus, without clearly understanding the real benefit of using "OCRmyPDF" , I will not suggest to integrate it

harish2704 avatar Aug 26 '20 19:08 harish2704

I've opened a PR on camelot which will add OCR support using EasyOCR: https://github.com/camelot-dev/camelot/pull/209 I'm working towards a release, it's been slow because life is getting in the way.

vinayak-mehta avatar Nov 09 '20 14:11 vinayak-mehta

@vinayak-mehta Could you pull this into an experimental branch. Perhaps this would allow @harish2704 to work on it.

Just a thought!

arky avatar Apr 03 '21 17:04 arky

@harish2704 You should be able to use camelot+easyocr using the https://github.com/camelot-dev/camelot/pull/209 branch.

@arky Is that what you meant?

vinayak-mehta avatar Apr 04 '21 20:04 vinayak-mehta