MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

multi model serving IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'

Open yc2984 opened this issue 2 years ago • 2 comments

Hi, I'm having this error (full log) when serving two sklearn models on my local kind cluster following this guide Chatted with Alejandro on slack about this issue, he was able to reproduce it, link to the thread: https://seldondev.slack.com/archives/C03DQFTFXMX/p1659988373014049

  File "/usr/local/bin/mlserver", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 79, in main
    root()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 20, in wrapper
    return asyncio.run(f(*args, **kwargs))
  File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 44, in start
    await server.start(models_settings)
  File "/usr/local/lib/python3.8/site-packages/mlserver/server.py", line 98, in start
    await asyncio.gather(
  File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 272, in load
    return await self._models[model_settings.name].load(model_settings)
  File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 143, in load
    await self._load_model(new_model)
  File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 151, in _load_model
    await model.load()
  File "/usr/local/lib/python3.8/site-packages/mlserver_sklearn/sklearn.py", line 36, in load
    self._model = joblib.load(model_uri)
  File "/usr/local/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 579, in load
    with open(filename, 'rb') as f:
IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

My Seldon deployment definition file:

metadata:
  name: multi-model
  namespace: seldon
spec:
  protocol: v2
  name: multi-model
  predictors:
  - graph:
      type: MODEL
      implementation: SKLEARN_SERVER
      modelUri: gs://<bucket_name>/MODELS/MultiModel
      name: multi
      parameters:
        - name: method
          type: STRING
          value: predict
      envSecretRefName: seldon-rclone-secret
    name: default
    replicas: 1

Inside MultiModel directory:

├── IrisModel
│   ├── model-settings.json
│   └── model.joblib
├── RandomForestModel
│   ├── model-settings.json
│   └── model.joblib
├── multi_model.yaml
└── settings.json

model-setting.json:

{ "name": "RandomForestModel", "implementation": "mlserver_sklearn.SKLearnModel" }

and

{ "name": "RandomForestModel", "implementation": "mlserver_sklearn.SKLearnModel" }

yc2984 avatar Aug 10 '22 13:08 yc2984

Thank you for reporting this issue, I have managed to replicate, it seems the culprit is the model_uri functionality, which currenty must have stopped working due to some changes for multi-model serving. It can be possible to replicate by running:

MLSERVER_MODEL_URI=/model/path mlserver start /model/path

We'll have a deeper look at the fix and we'll push the updated fix

axsaucedo avatar Aug 10 '22 17:08 axsaucedo

You should be able to currently be able to workaround this with env vars override:

kubectl apply -f - << END
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
  - name: default
    graph:
      name: model
      implementation: SKLEARN_SERVER
      modelUri: s3://seldon-models/mlserver/mms
      envSecretRefName: seldon-init-container-secret
    componentSpecs:
      - spec:
          containers:
          - name: model
            env:
            - name: MLSERVER_MODEL_URI
              value: ""
END

image

axsaucedo avatar Aug 10 '22 18:08 axsaucedo

Hey @yc2984 ,

Following up from @axsaucedo's suggestion, have you been able to try out this workaround?

adriangonz avatar Aug 16 '22 15:08 adriangonz

@adriangonz No, I was not able to make it work with this workaround. seldon-core-operator-1.14.0, I had to specify an image otherwise I got this error Deployment.apps "multi-model-default-0-model-multi" is invalid: spec.template.spec.containers[0].image: Required value. And when I did, I saw an additional container created with name model, nothing happening in that container. The container called multi still has the same error. File that I used:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: multi-model
  namespace: seldon
spec:
  protocol: v2
  name: multi-model
  predictors:
  - graph:
      type: MODEL
      implementation: SKLEARN_SERVER
      modelUri: gs://<bucket>/MODELS/MultiModel
      name: multi
      parameters:
        - name: method
          type: STRING
          value: predict
      envSecretRefName: seldon-rclone-secret
    componentSpecs:
      - spec:
          containers:
          - name: model
            image: local_repo/mlserver-sklearn:v1.0-no-setting
            env:
            - name: MLSERVER_MODEL_URI
              value: ""
    name: default
    replicas: 1

yc2984 avatar Aug 16 '22 20:08 yc2984

Hey @yc2984 ,

Thanks for trying that out.

The errors you mention seem cause to syntax issues within the SeldonDeployment manifest itself and not due to MLServer. I've amended the manifest based on what you shared.

Could you try the one below?

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: multi-model
  namespace: seldon
spec:
  protocol: v2
  name: multi-model
  predictors:
  - graph:
      type: MODEL
      implementation: SKLEARN_SERVER
      modelUri: gs://<bucket>/MODELS/MultiModel
      name: multi
      envSecretRefName: seldon-rclone-secret
    componentSpecs:
      - spec:
          containers:
          - name: multi
            env:
            - name: MLSERVER_MODEL_URI
              value: ""
    name: default
    replicas: 1

For extra context, the name field of the entries under componentSpecs[].spec.containers[] must match the entries under graph[]. That is, if the node of your inference graph is named multi, then the entry of your containers[] list must also be named multi (as it will override some parameters of the pod spec for that particular node).

adriangonz avatar Aug 17 '22 08:08 adriangonz

hey @adriangonz I tried this config got the same error.

yc2984 avatar Aug 17 '22 17:08 yc2984

Hey @yc2984 ,

Could you share the exact manifest you tried and the error you got?

adriangonz avatar Aug 23 '22 16:08 adriangonz

Hey @adriangonz The file I'm using:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: multi-model
  namespace: seldon
spec:
  protocol: v2
  name: multi-model
  predictors:
  - graph:
      type: MODEL
      implementation: SKLEARN_SERVER
      modelUri: gs://<bucket>/MODELS/MultiModel
      name: multi
      parameters:
        - name: method
          type: STRING
          value: predict
      envSecretRefName: seldon-rclone-secret
    componentSpecs:
      - spec:
          containers:
          - name: multi
#            image: local_repo/mlserver-sklearn:v1.0-no-setting
            env:
            - name: MLSERVER_MODEL_URI
              value: ""
    name: default
    replicas: 1

Error (exact the same as original post): https://gist.github.com/yc2984/70c6690555db0b352956d28a7160fcd2

│ ERROR:    Traceback (most recent call last):                                                                                                                                                      │
│   File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run                                                                                                                             │
│     return loop.run_until_complete(main)                                                                                                                                                          │
│   File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete                                                                                                                       │
│   File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 44, in start                                                                                                           │
│     await server.start(models_settings)                                                                                                                                                           │
│   File "/usr/local/lib/python3.8/site-packages/mlserver/server.py", line 98, in start                                                                                                             │
│     await asyncio.gather(                                                                                                                                                                         │
│   File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 272, in load                                                                                                           │
│     return await self._models[model_settings.name].load(model_settings)                                                                                                                           │
│   File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 143, in load                                                                                                           │
│     await self._load_model(new_model)                                                                                                                                                             │
│   File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 151, in _load_model                                                                                                    │
│     await model.load()                                                                                                                                                                            │
│   File "/usr/local/lib/python3.8/site-packages/mlserver_sklearn/sklearn.py", line 36, in load                                                                                                     │
│     self._model = joblib.load(model_uri)                                                                                                                                                          │
│   File "/usr/local/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 579, in load                                                                                                         │
│     with open(filename, 'rb') as f:                                                                                                                                                               │
│ IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'                                                                                                                                       │
│                                                                                                                                                                                                   │
│ During handling of the above exception, another exception occurred:                                                                                                                               │
│                                                                                                                                                                                                   │
│ Traceback (most recent call last):                                                                                                                                                                │
│   File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 638, in lifespan                                                                                                       │
│     await receive()                                                                                                                                                                               │
│   File "/usr/local/lib/python3.8/site-packages/uvicorn/lifespan/on.py", line 135, in receive                                                                                                      │
│     return await self.receive_queue.get()                                                                                                                                                         │
│   File "/usr/local/lib/python3.8/asyncio/queues.py", line 163, in get                                                                                                                             │
│     await getter                                                                                                                                                                                  │
│ asyncio.exceptions.CancelledError                     

yc2984 avatar Aug 23 '22 17:08 yc2984

Hey @yc2984 ,

Thanks for sharing that.

It seems that you're using a custom image, would you be able to test with the standard SKLEARN_SERVER pre-packaged server that comes out-of-the-box in Seldon Core? This will ensure there are no side effects coming from the custom image.

adriangonz avatar Aug 24 '22 08:08 adriangonz

@adriangonz after updating configmap, I'm still able to reproduce this error with the default SKLEARN_SERVER.

"SKLEARN_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/sklearnserver"},"v2":{"defaultImageVersion":"1.1.0-sklearn","image":"seldonio/mlserver"}}}

apiVersion: v1
data:
  credentials: '{"gcs":{"gcsCredentialFileName":"gcloud-application-credentials.json"},"s3":{"s3AccessKeyIDName":"awsAccessKeyID","s3SecretAccessKeyName":"awsSecretAccessKey"}}'
  explainer: '{"image":"seldonio/alibiexplainer:1.14.0","image_v2":"seldonio/mlserver:1.1.0-alibi-explain"}'
  predictor_servers: '{"SKLEARN_V_1_SERVER": {"protocols": {"seldon": {"defaultImageVersion":
    "1.14.0", "image": "seldonio/sklearnserver"}, "v2": {"defaultImageVersion": "v1.0-no-setting",
    "image": "gcr.io/ewx-tryout/mlserver-sklearn"}}} ,"HUGGINGFACE_SERVER":{"protocols":{"v2":{"defaultImageVersion":"1.1.0-huggingface","image":"seldonio/mlserver"}}},"MLFLOW_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/mlflowserver"},"v2":{"defaultImageVersion":"1.1.0-mlflow","image":"seldonio/mlserver"}}},"SKLEARN_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/sklearnserver"},"v2":{"defaultImageVersion":"1.1.0-sklearn","image":"seldonio/mlserver"}}},"TEMPO_SERVER":{"protocols":{"v2":{"defaultImageVersion":"1.1.0-slim","image":"seldonio/mlserver"}}},"TENSORFLOW_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/tfserving-proxy"},"tensorflow":{"defaultImageVersion":"2.1.0","image":"tensorflow/serving"}}},"TRITON_SERVER":{"protocols":{"v2":{"defaultImageVersion":"21.08-py3","image":"nvcr.io/nvidia/tritonserver"}}},"XGBOOST_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/xgboostserver"},"v2":{"defaultImageVersion":"1.1.0-xgboost","image":"seldonio/mlserver"}}}}'
  storageInitializer: '{"cpuLimit":"1","cpuRequest":"100m","image":"seldonio/rclone-storage-initializer:1.14.0","memoryLimit":"1Gi","memoryRequest":"100Mi"}'
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"credentials":"{\"gcs\":{\"gcsCredentialFileName\":\"gcloud-application-credentials.json\"},\"s3\":{\"s3AccessKeyIDName\":\"awsAccessKeyID\",\"s3SecretAccessKeyName\":\"awsSecretAccessKey\"}}","explainer":"{\"image\":\"seldonio/alibiexplainer:1.14.0\",\"image_v2\":\"seldonio/mlserver:1.1.0-alibi-explain\"}","predictor_servers":"{\"SKLEARN_V_1_SERVER\": {\"protocols\": {\"seldon\": {\"defaultImageVersion\": \"1.14.0\", \"image\": \"seldonio/sklearnserver\"}, \"v2\": {\"defaultImageVersion\": \"v1.0-no-setting\", \"image\": \"gcr.io/ewx-tryout/mlserver-sklearn\"}}} ,\"HUGGINGFACE_SERVER\":{\"protocols\":{\"v2\":{\"defaultImageVersion\":\"1.1.0-huggingface\",\"image\":\"seldonio/mlserver\"}}},\"MLFLOW_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/mlflowserver\"},\"v2\":{\"defaultImageVersion\":\"1.1.0-mlflow\",\"image\":\"seldonio/mlserver\"}}},\"SKLEARN_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/sklearnserver\"},\"v2\":{\"defaultImageVersion\":\"1.1.0-sklearn\",\"image\":\"seldonio/mlserver\"}}},\"TEMPO_SERVER\":{\"protocols\":{\"v2\":{\"defaultImageVersion\":\"1.1.0-slim\",\"image\":\"seldonio/mlserver\"}}},\"TENSORFLOW_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/tfserving-proxy\"},\"tensorflow\":{\"defaultImageVersion\":\"2.1.0\",\"image\":\"tensorflow/serving\"}}},\"TRITON_SERVER\":{\"protocols\":{\"v2\":{\"defaultImageVersion\":\"21.08-py3\",\"image\":\"nvcr.io/nvidia/tritonserver\"}}},\"XGBOOST_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/xgboostserver\"},\"v2\":{\"defaultImageVersion\":\"1.1.0-xgboost\",\"image\":\"seldonio/mlserver\"}}}}","storageInitializer":"{\"cpuLimit\":\"1\",\"cpuRequest\":\"100m\",\"image\":\"seldonio/rclone-storage-initializer:1.14.0\",\"memoryLimit\":\"1Gi\",\"memoryRequest\":\"100Mi\"}"},"kind":"ConfigMap","metadata":{"annotations":{"meta.helm.sh/release-name":"seldon-core","meta.helm.sh/release-namespace":"seldon-system"},"creationTimestamp":"2022-08-05T08:29:43Z","labels":{"app":"seldon","app.kubernetes.io/instance":"seldon-core","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"seldon-core-operator","app.kubernetes.io/version":"1.14.0","control-plane":"seldon-controller-manager"},"name":"seldon-config","namespace":"seldon-system","resourceVersion":"1364227","uid":"48da8e56-d2ba-4a61-bd5e-eb4d8b6baaea"}}
    meta.helm.sh/release-name: seldon-core
    meta.helm.sh/release-namespace: seldon-system
  creationTimestamp: "2022-08-05T08:29:43Z"
  labels:
    app: seldon
    app.kubernetes.io/instance: seldon-core
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: seldon-core-operator
    app.kubernetes.io/version: 1.14.0
    control-plane: seldon-controller-manager
  name: seldon-config
  namespace: seldon-system
  resourceVersion: "1784725"
  uid: 48da8e56-d2ba-4a61-bd5e-eb4d8b6baaea

yc2984 avatar Aug 24 '22 09:08 yc2984

Hey @yc2984 ,

After having a deeper look, we've managed to confirm that Seldon Core doesn't currently let you override the MLSERVER_MODEL_URI env var through the SeldonDeployment manifest (it's always forced to be /mnt/models). This means that all the different models that you load from your model repository will get that URI by default, and thus will fail to load since their model artefacts are not in the /mnt/models folder.

We have opened an issue on the Seldon Core repo to address this, which can be seen below:

https://github.com/SeldonIO/seldon-core/issues/4298

In the meantime though, as a temporary workaround, you can explicitly set each model's URI through the parameters.uri field of their model-settings.json file. For example, doing something like:

{
  "name": "mnist-svm",
  "implementation": "mlserver_sklearn.SKLearnModel",
  "parameters": {
    "version": "v0.1.0",
    "uri": "model.joblib"
  }
}

This will ensure that the models ignore the default value provided through the MLSERVER_MODEL_URI env var and can get loaded correctly.

Note that, since this is not an MLServer issue, we will close it for now. However, feel free to continue the conversation on the new https://github.com/SeldonIO/seldon-core/issues/4298 ticket.

adriangonz avatar Aug 24 '22 16:08 adriangonz

hi @adriangonz I tried what you said but it still doesn't work, having the same issue.

yc2984 avatar Aug 28 '22 20:08 yc2984

Hey @yc2984,

Could you provide more details of what you tried and how it failed?

Did you try updating all the model-settings.json files in your model repo to include the parameters.uri parameter?

adriangonz avatar Aug 31 '22 07:08 adriangonz

Directory structure:

├── IrisModel
│   ├── model-settings.json
│   └── model.joblib
├── RandomForestModel
│   ├── model-settings.json
│   └── model.joblib
└── settings.json

Content of model-setting.json:

{
  "name": "IrisModel",
  "implementation": "mlserver_sklearn.SKLearnModel",
  "parallel_workers": 1,
  "parameters": {
    "uri": "IrisModel/model.joblib"
  }
}
{
  "name": "RandomForestModel",
  "implementation": "mlserver_sklearn.SKLearnModel",
  "parallel_workers": 1,
  "parameters": {
    "uri": "RandomForestModel/model.joblib"
  }
}

settings.json:

{
  "debug": "true"
}

seldon deployment definition:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: multi-model
  namespace: seldon
spec:
  protocol: v2
  name: multi-model
  predictors:
  - graph:
      type: MODEL
      implementation: SKLEARN_SERVER
      modelUri: gs://bucket/MODELS/MultiModel
      name: multi
      parameters:
        - name: method
          type: STRING
          value: predict
      envSecretRefName: seldon-rclone-secret
    name: default
    replicas: 1

yc2984 avatar Sep 05 '22 08:09 yc2984

Hey @yc2984 ,

Thanks for sharing those.

A couple points from my side based on that:

  1. The parameters.uri field is relative to where the model-settings.json file lives. Therefore, it should be just model.joblib on both cases.
  2. For each new iteration, it would be really useful if you can also share the output from your server (i.e. "how" it fails), as it may be different from previous attempts.

adriangonz avatar Sep 05 '22 13:09 adriangonz