PaddleOCR icon indicating copy to clipboard operation
PaddleOCR copied to clipboard

PaddleOCR returning only the first page when performing ocr on a PDF

Open homiecal opened this issue 8 months ago • 1 comments

问题描述 / Problem Description

When running PaddleOCR().ocr(), for pdf images, only 1 page is returned

运行环境 / Runtime Environment

  • Paddle: 2.6.1
  • PaddleOCR: 2.7.3
  • Python: 3.10
  • Docker container:
    # Use a base image with Python 3.10
    FROM python:3.10
    
    # Install libGL.so.1
    RUN apt-get update \
        && apt-get install -y libgl1-mesa-glx \
        && rm -rf /var/lib/apt/lists/*
    

复现代码 / Reproduction Code

This code is the Custom Production Routine (CPR) wrapper around PaddleOCR. The CPR performs these steps:

  1. loads the model into memory
  2. obtains files from Google Cloud Storage bucket and copies them to a temporary location
  3. runs PaddleOCR.ocr() with the temporary file
  4. return results

There is no bug in this code, I have tested locally calling the routines in the order the CPR performs them:

    p = PaddleOcrPredictor()
    p.load(artifacts_uri=os.path.join(src_dir, "model"))
    p.postprocess(p.predict(p.preprocess(prediction_input)))

What is very weird is when testing after the CPR has been initially deployed, we have had instances that return more than 1 page of the pdf, upon subsequent calls for the exact same document, we get only 1 page.

class PaddleOcrPredictor(Predictor):

    def __init__(
        self,
    ):

        self.model_det_name = "en_PP-OCRv3_det_infer"
        self.model_rec_name = "en_PP-OCRv3_rec_infer"
        self.model_cls_name = "ch_ppocr_mobile_v2.0_cls_infer"

        self._model = None  # loaded in self.load()
        self.model_kwargs = {"cls": False}

        # self.aip_storage_uri = os.getenv("AIP_STORAGE_URI", "")

    def load(self, artifacts_uri: str = "/usr/app/model"):
        print(f"Current directory {os.getcwd()}")
        print(f"Artifacts uri {artifacts_uri}")

        model_directories = [artifacts_uri, "/usr/app/model"]
        found_directory = next(
            (dir for dir in model_directories if os.path.exists(dir)), None
        )

        if found_directory:
            artifacts_uri = found_directory
        else:
            raise Exception(
                f"Can't find model in any of the locations: {', '.join(model_directories)}"
            )

        self._model = PaddleOCR(
            det_model_dir=f"{artifacts_uri}/{self.model_det_name}/",
            rec_model_dir=f"{artifacts_uri}/{self.model_rec_name}/",
            cls_model_dir=f"{artifacts_uri}/{self.model_cls_name}",
            lang="en",
            show_log=False,
            use_gpu=True,
        )

    def preprocess(self, prediction_input: dict):
        """
        Processes input request prior to model prediction.

        :param prediction_input: Input request (JSON format) must be a json with 'instances'
        The JSON request must contain an instances field and an optional parameters field if you're using a custom container. No other fields can be present in the JSON request.

        {"instances":
            [{
            "input": file path [either to temp file in workspace, or static file] or PIL image
            "model_kwargs": dict # See for more info: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppstructure/docs/quickstart_en.md#24-parameter-description
            }]
        }

        """
        # Obtain prediction
        instances = prediction_input["instances"]
        instance = instances[0]
        file_name = instance["input"]

        temp_cwd = os.getcwd()

        # Set default parameters
        project_id_bucket = "default-bucket
        bucket_name = f"{project_id_bucket}-image"
        local_file = False
        model_kwargs = {}

        print(f"Full prediction input: {prediction_input}")
        print(f"Instance value: {instance}")
        # Update parameters if they exist in prediction_input
        if "parameters" in prediction_input.keys():
            parameters = prediction_input["parameters"]
            project_id_bucket = parameters.get("project_id_bucket", project_id_bucket)
            bucket_name = parameters.get("bucket_name", bucket_name)
            local_file = parameters.get("local_file", local_file)
            model_kwargs = parameters.get("model_kwargs", model_kwargs)

        print(f"Prediction input: {file_name}")
        print(f"Current working directory: {temp_cwd}")

        if isinstance(file_name, str):

            # If local file check if it local path exists, else check whether file exists in GCS
            if local_file:
                assert os.path.exists(file_name), ValueError(
                    f"File name passed, but no file in directory: {file_name}"
                )
                local_file_path = file_name
            else:
                gcs = storage.Client(project_id_bucket)
                bucket = gcs.get_bucket(f"{bucket_name}")
                local_file_path = download_files_from_gcs_folder(
                    bucket, temp_cwd, file_name
                )
            instance.update({"input": local_file_path})

        else:
            raise ValueError(f"File path passed is not a str: {file_name}")

        instance.update({"model_kwargs": model_kwargs})

        return instance

    def predict(self, instances: dict):
        """ """
        print(instances)

        if "model_kwargs" in instances.keys():
            self.model_kwargs.update(instances["model_kwargs"])

        prediction_results = self._model.ocr(instances["input"])

        return prediction_results

    def postprocess(self, prediction_results):
        print("Post process stage reached")
        return {"predictions": [prediction_results]}

完整报错 / Complete Error Message

No error message, but PPOCR returns only 1 page from a multi-page pdf.

可能解决方案 / Possible solutions

NA

附件 / Appendix

homiecal avatar Jun 17 '24 04:06 homiecal