data-prep-kit
data-prep-kit copied to clipboard
[Bug] pdf2parquet ray version erroring out when downloading models for the very first time
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Tools/ingest2parquet
What happened + What you expected to happen
Happens when running RAY version, with NUM_WORKERS > 1. Reliably reproducible in google colab Running the cell again works.
But a negative user experience
(orchestrate pid=1575) 05:41:45 ERROR - Failed to process request worker exception The actor died because of an error raised in its creation task, ray::RayTransformFileProcessor.__init__() (pid=1784, ip=172.28.0.12, actor_id=09c62ae6504057816b30599401000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7ee7e55fbc40>)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_file_processor.py", line 46, in __init__
(orchestrate pid=1575) self.transform = params.get("transform_class", None)(self.transform_params)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform_ray.py", line 40, in __init__
(orchestrate pid=1575) super().__init__(config)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform.py", line 105, in __init__
(orchestrate pid=1575) self._converter = DocumentConverter(
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/docling/document_converter.py", line 54, in __init__
(orchestrate pid=1575) self.model_pipeline = pipeline_cls(
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/docling/pipeline/standard_model_pipeline.py", line 18, in __init__
(orchestrate pid=1575) EasyOcrModel(
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/docling/models/easyocr_model.py", line 21, in __init__
(orchestrate pid=1575) self.reader = easyocr.Reader(config["lang"])
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 92, in __init__
(orchestrate pid=1575) detector_path = self.getDetectorPath(detect_network)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 253, in getDetectorPath
(orchestrate pid=1575) download_and_unzip(self.detection_models[self.detect_network]['url'], self.detection_models[self.detect_network]['filename'], self.model_storage_directory, self.verbose)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/easyocr/utils.py", line 631, in download_and_unzip
(orchestrate pid=1575) os.remove(zip_path)
(orchestrate pid=1575) FileNotFoundError: [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip'
Reproduction script
https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb
Use open-in-colab link : https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb
Anything else
No response
OS
Other
Python
3.11.x
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
the error is quite obvious:
FileNotFoundError: [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip'
its either file do not exist or location is wrong
Yes, the error is quite obvious 🤣 my suspicion is its caused by a race condition between workers trying to cleanup downloaded artifacts.
Adding:
I see this consistently on Google colab, because each notebook gets their own sandbox.
To re-produce it locally, please delete the cache directory of downloaded artifacts (I am not sure where this is -- probably done by docling?)
related : #583
Yea, we know exactly why. Its up to the guys to decide what to do
Folks, is this issue fixed? cc @touma-I
Should be fixed in https://github.com/IBM/data-prep-kit/pull/756.
closing as issue is fixed