pdfsearch enhancing this package with "OCR" and "translation"

We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.

I will try to integrated this functionality in this package for our purposes and work on it in this on this fork: https://github.com/openefsa/pdfsearch

I could potentially contribute this as open source here, if you are interested. Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.

In case, you do not want to couple the package too much to a commercial provide such as Azure, it might be useful to have 2 extension points on this package, which allow to "plugin"

a OCR-provider
a translation-provider

Nov 20 '20 09:11 behrica

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

Nov 20 '20 17:11 lebebr01

I have a working implementation.

There is one piece of code, which could be made agnostic, having these concept:

one function : ocr_pdf which takes as input a PDF path and outputs a character vector
one function: translate_text which takes character vector in and returns a character vector
I added the concept "language detection" of text and of a "target language", and translation is only called if "text language " and "target language do not match. Maybe a a general case, would need to handle several target languages

As the Azure APIs (2, one for OCR, one for translation), I need top pass in credentials in some form.

I have the credentials "hardcoded" as function parameters, but we should do this differently.

As my implementation calls slow / expensive APIs, I implemented as well caching via memoization (but this is a implementation detail od Azure)

Nov 20 '20 21:11 behrica

I am not an expert in R. Is there a standard concept in R, of "extension" points ? It is just via "passing a function" into an other function?

Nov 20 '20 21:11 behrica

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

I am now "ready" for our internal usage. My colleges (non technicians) have now an very easy way to search in: PDFs, independent if having extractable text or are scanned and/or non-english.

I changed the existing code of keyword_search slightly, into three directions:

extension point to plugin a "OCR function"
extension point to plugin a "translate" function
some simple logic to decide if 1) and 2) should be called, depending if:
- pdf_text returns "empty" (if < 100 characters)
- a language detector (franc package), which decides if current text is already in a target language

The code has now as well an "azure based" implementation of the 2 extension points This is "quick and dirty", but for us very useful Its "biggest task" is to chunk the text in small enough pieces, so that the API of azure accepts them. The OCR Api is a push-task-and-poll-status type of API, so I implement as well the "waiting for a result".

I would be happy to collaborate with you further to move this into the upstream version of teh package

Dec 09 '20 16:12 behrica

@lebebr01 please let me know, if you want me to do anything on the #23

Dec 09 '20 16:12 behrica

Thanks, @behrica. I'll take a look more closely soon. Likely won't be for at least a week or so, I need to get through the end of the semester here first.

Dec 09 '20 20:12 lebebr01

pdfsearch pdfsearch copied to clipboard

enhancing this package with "OCR" and "translation"

pdfsearch
pdfsearch copied to clipboard