Generating OCR documents seems very complex
I'm trying to use Document AI to generate a searchable PDF out of an input documents. Given the marketing around Document AI and the availability of a pretrained "Document OCR" processor, I'm assuming this is one of the intended use cases.
So i'm uploading the file to GCS, running a Document OCR batch job, getting back a document, so far so good.
However, the subsequent workflow then becomes messy very quickly:
- As far as I understand, Document AI may internally deskew, convert and/or downscale the submitted document before performing OCR. So in order to have the overlay text match the displayed, I'm writing out the data stored in
document["pages"][i]["image"]["content"]. However, thedocument.pageshelper function only yields wrapped pages which don't expose the content as far as I can tell, so I'm forced to first export the whole document as json (in order to handle shards, which aren't documented anywhere btw!), and then to re-parse the json again in order to get the content:
wrapped_document = document.Document.from_gcs(
gcs_bucket_name=output_bucket, gcs_prefix=output_prefix
)
merged_document = wrapped_document.to_merged_documentai_document()
document_json_string = documentai.Document.to_json(merged_document)
document_json = json.loads(document_json_string)
for i, page in enumerate(document_json["pages"]):
raw_content = base64.b64decode(page["image"]["content"])
# ...
- In order to get the OCR layer, I'm using
document.export_hocr_str(). However, that returns a multi-page.hocrfile, and all downstream tooling I could find expects one.hocrand one image as input to produce one pdf page. So I have to split the returned.hocrby pages, generate a lot of individual pdfs, and finally use pikepdf to merge them together into a single output document.
Both of these feel very cumbersome, given that I'm already using the toolbox library that is supposed to make working with the API painless.
Am I missing some companion library that would make this workflow easier? I assume google has some internal libraries that handles these steps, would it make sense to include these in the toolbox?