OCRmyPDF
OCRmyPDF copied to clipboard
Azure ocr with ocrmypdf
ocrmypdf works great with pdfs with scanned images . However in case of handwritten letter, the tessaract-ocr engine struggles many a time. How do I use Azure ocr API as the OCR engine keeping everything else the same
OCRmyPDF has a plugin interface that would allow you to replace Tesseract with a different OCR engine such as Azure. To the best of my knowledge no one has published a plugin that does this (or for that matter, any plugin, since the plugin interface is quite new).
OCRmyPDF can only interpret the hOCR format or a text only PDF, so you'd have to convert Azure's output to one of those two as well, since unfortunately it does not support either standard (last time I looked, anyway).
The azure output looks something like
{"status": "Succeeded", "recognitionResult": {"lines": [{"boundingBox": [292, 146, 780, 144, 781, 218, 293, 220], "text": "string1", "words": [{"boundingBox": [297, 150, 774, 145, 775, 218, 300, 218], "text": "string2"}]}, {"boundingBox": [327, 215, 748, 219, 747, 255, 326, 252], "text": "string3 string4", "words": [{"boundingBox": [330, 219, 496, 219, 498, 253, 332, 251], "text": "string3"}, "text": "string4"}]}]}}
Is it possible to convert this into one of the formats that you mentioned ?
Is it possible to convert this into one of the formats that you mentioned?
If you look at hOCR format example given on Wikipedia, I would say yes. Besides, have look here: https://stackoverflow.com/questions/62074677/generate-hocr-from-microsoft-computer-vision-ocr
Another alternative could be https://github.com/JaidedAI/EasyOCR but it outputs in a simple list only. I think that could be converted in hOCR easily.
I asked here https://github.com/ocrmypdf/OCRmyPDF/issues/915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.
I asked here #915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.
Its not that hard to implement. I have a very basic python code, that uses google vision api to get better results. For the orientation of the page I use tesseract, because its the easiest way. But for text recognition the image is sent to vision api, the json response get converted to hocr and you have your textlayer.
@kkrell2016 May I ask how the conversion from json to hocr happen? Have you written your own script for that purpose?
@isspid I found a project called gcv2hocr and combined it with some custom python code. The custom python script can be run as a plugin in ocrmypdf. I also uploaded it to my github, should be publicly available. I had to modify gcv2hocr a bit to make it work with the current Google Vision API.
If you have any questions please contact me
I think the above plugin interface (e.g. generate_pdf(input_file, output_pdf, output_text, options)
) will be called for each page in the pdf instead of the whole pdf? Is there an interface which gives the entire pdf and we can return back a list of hocr files generated from an azure OCR result or a single hocr file for many pages(if this is possible in the hocr format, I have to learn).
Then I can then call
# something like this for multiple pages?
helper = hocrtransform.HocrTransform(
hocr_filename=hocr_file, # or list_of_hocr_files
dpi=150
)
helper.to_pdf(out_filename=output_pdf) # a multi page pdf
Our use case is that we send a batch of pages to Azure OCR(otherwise it'll be very slow to process many pages for us) and it returns an Azure OCR result object for all the pages. I can loop through each page object of the Azure OCR result object and generate either a list of hocr objects (where one hocr object corresponds to a page) or a single hocr object(if that's possible)
I guess I can call the azure engine in the global part of the file and then cache it and then when generate_pdf
is called, just pick it from there. but how I know which key in the cache to pick up? e.g. I'll key the cache by page number for e.g. but I won't know from generate_pdf
which page it is for, as it does not provide a page number iiuc?
I noticed generate_pdfa
. Would that be useful here? and then I call helper
above in a loop and then merge the single page files got from helper.to_pdf
?
Curious about all this myself. Anyone have a working example of converting e.g. the Azure output to hOCR?
@shamoon Looks we found the same thread ^^ I'm currently trying to make easyocr compatible with paperless-ngx, see https://github.com/paperless-ngx/paperless-ngx/discussions/6056#discussioncomment-8801102 I found an azure to hocr script, and will probably write mine for easyocr.