OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Pass existing OCR-Data in ALTO-Format

Open M3ssman opened this issue 4 years ago • 3 comments

I'd like to use OCRmyPDF with existing OCR-Data, since I already have thousands of them in ALTO V3. Is it possible to decouple the creation of OCR from rendering the data?

M3ssman avatar Jul 06 '20 04:07 M3ssman

I don't quite understand what you're trying to do. Other than ALTO, what inputs do you have (do you have the original PDFs the ALTO was derived from), and what outputs are you trying to get?

Using a ocrmypdf plugin you could create an "OCR engine" that just supplies ALTO data at the appropriate time. The test suite does something like this - providing OCR results from a cache to speed up tests.

jbarlow83 avatar Jul 06 '20 07:07 jbarlow83

I already have OCR-ALTO from Tesseract together with PDF-Files that currently lack textlayers, although they have a table of contents and some metadata (which I'd like also to modify, and as far as I know, this could be achieved of course with pikepdf, pyMuPDF or PyPDF4). Therefore, I'd like to integrate what is already there, using OCRmyPDF for rendering the textlayers. Unfortunately, I couldn't figure out how to use OCRmyPDF this way. I thought there is a cli-flag or parameter to skip Tesseract and use existing ALTO from an external location/path/folder.

But thanks for your hint with the test-suite, I'll take a closer look!

M3ssman avatar Jul 06 '20 10:07 M3ssman

Okay, now I see what you're trying to go. There isn't any way to do without some new feature development.

You could write a fake OCR engine plugin that does ALTO to HOCR and provides HOCR as output. You could use a program like https://github.com/UB-Mannheim/ocr-fileformat to convert ALTO to HOCR, which would become an input to ocrmypdf's existing hocrtransform.py. That would keep the benefits of ocrmypdf's ability to add a OCR to the original document rather than replace it.

The main drawback of hocrtransform right now is that it does not do full Unicode properly, but it should be fine for Latin scripts. Another possible issue with this approach is that ALTO may not have enough detail for pixel perfect alignment between the image and text layer. Also, it's possible that the flavor of HOCR generated by ocr-fileformat may differ from the flavor generated by Tesseract, so hocrtransform may need adjustment. Still, it seems like a viable route to me.

(Now if it were me, I'd absolutely keep those ALTO files if they are manually generated or corrected, but if those files are just the output of an OCR engine, it's probably a lot less effort to discard them and just use ocrmypdf to get the OCR directly.)

pikepdf can edits both metadata types in PDF and keeps them consistent. I don't know if the others bother.

jbarlow83 avatar Jul 06 '20 21:07 jbarlow83