Generating OCR documents seems very complex

Open lava opened this issue 11 months ago • 0 comments

I'm trying to use Document AI to generate a searchable PDF out of an input documents. Given the marketing around Document AI and the availability of a pretrained "Document OCR" processor, I'm assuming this is one of the intended use cases.

So i'm uploading the file to GCS, running a Document OCR batch job, getting back a document, so far so good.

However, the subsequent workflow then becomes messy very quickly:

As far as I understand, Document AI may internally deskew, convert and/or downscale the submitted document before performing OCR. So in order to have the overlay text match the displayed, I'm writing out the data stored in document["pages"][i]["image"]["content"]. However, the document.pages helper function only yields wrapped pages which don't expose the content as far as I can tell, so I'm forced to first export the whole document as json (in order to handle shards, which aren't documented anywhere btw!), and then to re-parse the json again in order to get the content:

    wrapped_document = document.Document.from_gcs(
            gcs_bucket_name=output_bucket, gcs_prefix=output_prefix
    )
    merged_document = wrapped_document.to_merged_documentai_document()
    document_json_string = documentai.Document.to_json(merged_document)
    document_json = json.loads(document_json_string)
        
    for i, page in enumerate(document_json["pages"]):
        raw_content = base64.b64decode(page["image"]["content"])
        # ...

In order to get the OCR layer, I'm using document.export_hocr_str(). However, that returns a multi-page .hocr file, and all downstream tooling I could find expects one .hocr and one image as input to produce one pdf page. So I have to split the returned .hocr by pages, generate a lot of individual pdfs, and finally use pikepdf to merge them together into a single output document.

Both of these feel very cumbersome, given that I'm already using the toolbox library that is supposed to make working with the API painless.

Am I missing some companion library that would make this workflow easier? I assume google has some internal libraries that handles these steps, would it make sense to include these in the toolbox?

Jan 19 '25 13:01 lava