mlflow-torchserve icon indicating copy to clipboard operation
mlflow-torchserve copied to clipboard

MLFlow model deplyoments error when deploying PyTorch model from GCS bucket - ModuleNotFoundError: No module named 'models'

Open akasantony opened this issue 4 years ago • 1 comments

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • [ ] Yes. I can contribute a fix for this bug independently.
  • [x] Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • [ ] No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 18.04): Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux
  • MLflow installed from (source or binary): binary
  • MLflow version (run mlflow --version): 1.19.0
  • MLflow TorchServe Deployment plugin installed from (source or binary): binary
  • MLflow TorchServe Deployment plugin version (run mlflow deployments--version): 0.1.0
  • TorchServe installed from (source or binary): binary
  • TorchServe version (run torchserve --version): 0.4.2
  • Python version: 3.9.6
  • Exact command to reproduce: mlflow deployments create -t torchserve -m gs://<model_bucket>/models/classnet/48d548cc841d4c2b9a06e975dec88c8e/artifacts/classnet_model --name classnet -C 'MODEL_FILE=models/classnet.py' -C 'HANDLER=model_handler.py' -C 'EXTRA_FILES=transforms.py,artifacts/models/desnse_depth.pt,models/dense_depth.py'

Describe the problem

I have trained a custom PyTorch models for an image classification problem. The model is logged to a Google Cloud Storage bucket. When I try to deploy the model to torchserve I get: ModuleNotFoundError: No module named 'models' error. From what I understand mlflow.pytorch.log_model() calls torch.save(model) internally. This creates a dependency on the directory structure 18325 .

Code to reproduce issue

I have saved the MLFlow model on GCS bucket using the below script: mlflow.pytorch.log_model(model, "{}_model".format('livenet'))

The model deployment is using the command below: mlflow deployments create -t torchserve -m gs://<model_bucket>/models/classnet/48d548cc841d4c2b9a06e975dec88c8e/artifacts/classnet_model --name classnet -C 'MODEL_FILE=models/classnet.py' -C 'HANDLER=model_handler.py' -C 'EXTRA_FILES=transforms.py,artifacts/models/desnse_depth.pt,models/dense_depth.py'

Other info / logs

2021-08-30 10:13:37,521 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG - /tmp/models/555ed568ad5f4fb4a4ebe1b231e298fb/model.pth
2021-08-30 10:13:37,523 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG - <class 'livenet.LiveNet'>
2021-08-30 10:13:38,338 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG - Backend worker process died.
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/model_service_worker.py", line 183, in <module>
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     worker.run_server()
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/model_service_worker.py", line 155, in run_server
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/model_service_worker.py", line 117, in handle_connection
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2021-08-30 10:13:38,339 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/model_service_worker.py", line 90, in load_model
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/model_loader.py", line 110, in load
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/torch_handler/vision_handler.py", line 20, in initialize
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     super().initialize(context)
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/torch_handler/base_handler.py", line 69, in initialize
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     self.model = self._load_pickled_model(model_dir, model_file, model_pt_path)
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/ts/torch_handler/base_handler.py", line 133, in _load_pickled_model
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     state_dict = torch.load(model_pt_path, map_location=self.device)
2021-08-30 10:13:38,340 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
2021-08-30 10:13:38,341 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
2021-08-30 10:13:38,341 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
2021-08-30 10:13:38,341 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     result = unpickler.load()
2021-08-30 10:13:38,341 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/vkyc/lib/python3.9/site-packages/torch/serialization.py", line 875, in find_class
2021-08-30 10:13:38,341 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG -     return super().find_class(mod_name, name)
2021-08-30 10:13:38,341 [INFO ] W-9000-spoofnet_1.0-stdout MODEL_LOG - ModuleNotFoundError: No module named 'models'

What component(s) does this bug affect?

Components

  • [x] area/deploy: Main deployment plugin logic
  • [ ] area/build: Build and test infrastructure for MLflow TorchServe Deployment Plugin
  • [ ] area/docs: MLflow TorchServe Deployment Plugin documentation pages
  • [ ] area/examples: Example code

akasantony avatar Aug 30 '21 10:08 akasantony

@akasantony

From the information you have shared, the model is saved using mlflow.pytorch.log_model and while loading the model, base handler is trying to load it as state dict.

mlflow.pytorch.log_model uses cloudpickle and saves the entire model structure using torch.save. To load the model, you can refer this custom handler IrisClassification example. In this example, the model is saved using mlflow.pytorch and loaded using torch.load.

We are about to make another release. there are lot of changes went in after 0.1.0 release. Can you please install the mlflow-torchserve plugin from source - Reference - https://github.com/mlflow/mlflow-torchserve/blob/master/README.md#installation

shrinath-suresh avatar Sep 07 '21 18:09 shrinath-suresh