PaddleOCR
PaddleOCR copied to clipboard
PaddleOCR returning only the first page when performing ocr on a PDF
问题描述 / Problem Description
When running PaddleOCR().ocr(), for pdf images, only 1 page is returned
运行环境 / Runtime Environment
- Paddle: 2.6.1
- PaddleOCR: 2.7.3
- Python: 3.10
- Docker container:
# Use a base image with Python 3.10 FROM python:3.10 # Install libGL.so.1 RUN apt-get update \ && apt-get install -y libgl1-mesa-glx \ && rm -rf /var/lib/apt/lists/*
复现代码 / Reproduction Code
This code is the Custom Production Routine (CPR) wrapper around PaddleOCR. The CPR performs these steps:
- loads the model into memory
- obtains files from Google Cloud Storage bucket and copies them to a temporary location
- runs PaddleOCR.ocr() with the temporary file
- return results
There is no bug in this code, I have tested locally calling the routines in the order the CPR performs them:
p = PaddleOcrPredictor()
p.load(artifacts_uri=os.path.join(src_dir, "model"))
p.postprocess(p.predict(p.preprocess(prediction_input)))
What is very weird is when testing after the CPR has been initially deployed, we have had instances that return more than 1 page of the pdf, upon subsequent calls for the exact same document, we get only 1 page.
class PaddleOcrPredictor(Predictor):
def __init__(
self,
):
self.model_det_name = "en_PP-OCRv3_det_infer"
self.model_rec_name = "en_PP-OCRv3_rec_infer"
self.model_cls_name = "ch_ppocr_mobile_v2.0_cls_infer"
self._model = None # loaded in self.load()
self.model_kwargs = {"cls": False}
# self.aip_storage_uri = os.getenv("AIP_STORAGE_URI", "")
def load(self, artifacts_uri: str = "/usr/app/model"):
print(f"Current directory {os.getcwd()}")
print(f"Artifacts uri {artifacts_uri}")
model_directories = [artifacts_uri, "/usr/app/model"]
found_directory = next(
(dir for dir in model_directories if os.path.exists(dir)), None
)
if found_directory:
artifacts_uri = found_directory
else:
raise Exception(
f"Can't find model in any of the locations: {', '.join(model_directories)}"
)
self._model = PaddleOCR(
det_model_dir=f"{artifacts_uri}/{self.model_det_name}/",
rec_model_dir=f"{artifacts_uri}/{self.model_rec_name}/",
cls_model_dir=f"{artifacts_uri}/{self.model_cls_name}",
lang="en",
show_log=False,
use_gpu=True,
)
def preprocess(self, prediction_input: dict):
"""
Processes input request prior to model prediction.
:param prediction_input: Input request (JSON format) must be a json with 'instances'
The JSON request must contain an instances field and an optional parameters field if you're using a custom container. No other fields can be present in the JSON request.
{"instances":
[{
"input": file path [either to temp file in workspace, or static file] or PIL image
"model_kwargs": dict # See for more info: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppstructure/docs/quickstart_en.md#24-parameter-description
}]
}
"""
# Obtain prediction
instances = prediction_input["instances"]
instance = instances[0]
file_name = instance["input"]
temp_cwd = os.getcwd()
# Set default parameters
project_id_bucket = "default-bucket
bucket_name = f"{project_id_bucket}-image"
local_file = False
model_kwargs = {}
print(f"Full prediction input: {prediction_input}")
print(f"Instance value: {instance}")
# Update parameters if they exist in prediction_input
if "parameters" in prediction_input.keys():
parameters = prediction_input["parameters"]
project_id_bucket = parameters.get("project_id_bucket", project_id_bucket)
bucket_name = parameters.get("bucket_name", bucket_name)
local_file = parameters.get("local_file", local_file)
model_kwargs = parameters.get("model_kwargs", model_kwargs)
print(f"Prediction input: {file_name}")
print(f"Current working directory: {temp_cwd}")
if isinstance(file_name, str):
# If local file check if it local path exists, else check whether file exists in GCS
if local_file:
assert os.path.exists(file_name), ValueError(
f"File name passed, but no file in directory: {file_name}"
)
local_file_path = file_name
else:
gcs = storage.Client(project_id_bucket)
bucket = gcs.get_bucket(f"{bucket_name}")
local_file_path = download_files_from_gcs_folder(
bucket, temp_cwd, file_name
)
instance.update({"input": local_file_path})
else:
raise ValueError(f"File path passed is not a str: {file_name}")
instance.update({"model_kwargs": model_kwargs})
return instance
def predict(self, instances: dict):
""" """
print(instances)
if "model_kwargs" in instances.keys():
self.model_kwargs.update(instances["model_kwargs"])
prediction_results = self._model.ocr(instances["input"])
return prediction_results
def postprocess(self, prediction_results):
print("Post process stage reached")
return {"predictions": [prediction_results]}
完整报错 / Complete Error Message
No error message, but PPOCR returns only 1 page from a multi-page pdf.
可能解决方案 / Possible solutions
NA