DocumentAI (OCR Processor) failing with 400. I have processed 200+ pdfs successfully, and figured the problem, but don't know of any automated solution...
Thank you for your time.
Error message: Error: google.api_core.exceptions.InvalidArgument: 400 Unsupported input file format.
Processor details: OCR Version: pretrained-ocr-v2.0-2023-06-02
Problem statement: I have been using long running operation i.e. client.batch_process_documents(request) for over 200+ pdfs with as many as 350 pages per pdf without any issue, and then suddenly, things are failing.
Mitigation & Root cause: I suspected inadvertent code change issue. I then stripped code down to the basic (console testing. etc), rebuilt it, and I got it working with the conclusion that the original pdf has something off in it, and the same pdf when "opened under mac / preview -> Print -> save as PDF" is working fine.
Process of elimination: The pdf in question has 218 pages. The pdf when stripped down to first 10 pages runs fine on console as well as on local / python using ProcessRequest:
request = documentai.ProcessRequest(name=processor.name, gcs_document=gcs_document) # when running on gcs
result = client.process_document(request=request)
So breaking down the problem, I stripped down the pdf to first 10 pages (p10.pdf), first 20 pages (p20.pdf), first 30 pages (p30.pdf) ... all 218 pages (pAll.pdf) etc to run in batch mode:
long_ops = client.batch_process_documents(request)
metadata = documentai.BatchProcessMetadata(long_ops.metadata)
for process in list(metadata.individual_process_statuses):
...
The WEIRD thing is that all partitioned files ran successfully - including the pdf with all pages in it (pAll.pdf). The only delta between the original pdf and the revised pdf is that I opened the original pdf under mac / preview, and did a "print as PDF" to save as pAll.pdf.
What would be going on? I don't wish to manually execute a "print as PDF" on my remaining workload (400+ files). Is there a systemic option in Python to work around this. I tried pymupdf as Open / Save, but that didn't work as well.
Further investigation:
- On using pymupdf, the batch process did run, and generated the multiple jsons.
- However converting jsons are text generated zero bytes.
json_as_bytes = documentai.Document.from_json(
json_blob.download_as_bytes(), ignore_unknown_fields=True
)
doc_blob_name=f"txt/{operation_id}/{input_file_no}/{base_name}.txt"
doc_blob = bucket.blob(doc_blob_name)
doc_blob.content_type = "text/plain"
# Write document as text
with doc_blob.open("w") as f:
f.write(json_as_bytes.text)
Attached original file (Chahekati_Chetana_34.pdf) and the pAll.pdf file (pAll_Chahekati_Chetana_34.pdf).
Hi @mdeliw, this seems like an issue with the Document AI API and not with the Python client. Please file an issue directly with the document AI API team by reaching out to the support team here.
I'm going to close this issue as the recommended next step is to file an issue in the API specific issue tracker : https://issuetracker.google.com/issues/new?component=1132231
If your GCS object has Content-Encoding: gzip metadata, Document AI will fail with 400.