MLServer
MLServer copied to clipboard
multi model serving IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'
Hi, I'm having this error (full log) when serving two sklearn models on my local kind cluster following this guide Chatted with Alejandro on slack about this issue, he was able to reproduce it, link to the thread: https://seldondev.slack.com/archives/C03DQFTFXMX/p1659988373014049
File "/usr/local/bin/mlserver", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 79, in main
root()
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 20, in wrapper
return asyncio.run(f(*args, **kwargs))
File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 44, in start
await server.start(models_settings)
File "/usr/local/lib/python3.8/site-packages/mlserver/server.py", line 98, in start
await asyncio.gather(
File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 272, in load
return await self._models[model_settings.name].load(model_settings)
File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 143, in load
await self._load_model(new_model)
File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 151, in _load_model
await model.load()
File "/usr/local/lib/python3.8/site-packages/mlserver_sklearn/sklearn.py", line 36, in load
self._model = joblib.load(model_uri)
File "/usr/local/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 579, in load
with open(filename, 'rb') as f:
IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
My Seldon deployment definition file:
metadata:
name: multi-model
namespace: seldon
spec:
protocol: v2
name: multi-model
predictors:
- graph:
type: MODEL
implementation: SKLEARN_SERVER
modelUri: gs://<bucket_name>/MODELS/MultiModel
name: multi
parameters:
- name: method
type: STRING
value: predict
envSecretRefName: seldon-rclone-secret
name: default
replicas: 1
Inside MultiModel directory:
├── IrisModel
│ ├── model-settings.json
│ └── model.joblib
├── RandomForestModel
│ ├── model-settings.json
│ └── model.joblib
├── multi_model.yaml
└── settings.json
model-setting.json:
{ "name": "RandomForestModel", "implementation": "mlserver_sklearn.SKLearnModel" }
and
{ "name": "RandomForestModel", "implementation": "mlserver_sklearn.SKLearnModel" }
Thank you for reporting this issue, I have managed to replicate, it seems the culprit is the model_uri functionality, which currenty must have stopped working due to some changes for multi-model serving. It can be possible to replicate by running:
MLSERVER_MODEL_URI=/model/path mlserver start /model/path
We'll have a deeper look at the fix and we'll push the updated fix
You should be able to currently be able to workaround this with env vars override:
kubectl apply -f - << END
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: my-model
spec:
protocol: v2
predictors:
- name: default
graph:
name: model
implementation: SKLEARN_SERVER
modelUri: s3://seldon-models/mlserver/mms
envSecretRefName: seldon-init-container-secret
componentSpecs:
- spec:
containers:
- name: model
env:
- name: MLSERVER_MODEL_URI
value: ""
END
Hey @yc2984 ,
Following up from @axsaucedo's suggestion, have you been able to try out this workaround?
@adriangonz No, I was not able to make it work with this workaround. seldon-core-operator-1.14.0,
I had to specify an image otherwise I got this error Deployment.apps "multi-model-default-0-model-multi" is invalid: spec.template.spec.containers[0].image: Required value
. And when I did, I saw an additional container created with name model
, nothing happening in that container. The container called multi still has the same error.
File that I used:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: multi-model
namespace: seldon
spec:
protocol: v2
name: multi-model
predictors:
- graph:
type: MODEL
implementation: SKLEARN_SERVER
modelUri: gs://<bucket>/MODELS/MultiModel
name: multi
parameters:
- name: method
type: STRING
value: predict
envSecretRefName: seldon-rclone-secret
componentSpecs:
- spec:
containers:
- name: model
image: local_repo/mlserver-sklearn:v1.0-no-setting
env:
- name: MLSERVER_MODEL_URI
value: ""
name: default
replicas: 1
Hey @yc2984 ,
Thanks for trying that out.
The errors you mention seem cause to syntax issues within the SeldonDeployment
manifest itself and not due to MLServer. I've amended the manifest based on what you shared.
Could you try the one below?
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: multi-model
namespace: seldon
spec:
protocol: v2
name: multi-model
predictors:
- graph:
type: MODEL
implementation: SKLEARN_SERVER
modelUri: gs://<bucket>/MODELS/MultiModel
name: multi
envSecretRefName: seldon-rclone-secret
componentSpecs:
- spec:
containers:
- name: multi
env:
- name: MLSERVER_MODEL_URI
value: ""
name: default
replicas: 1
For extra context, the name
field of the entries under componentSpecs[].spec.containers[]
must match the entries under graph[]
. That is, if the node of your inference graph is named multi
, then the entry of your containers[]
list must also be named multi
(as it will override some parameters of the pod spec for that particular node).
hey @adriangonz I tried this config got the same error.
Hey @yc2984 ,
Could you share the exact manifest you tried and the error you got?
Hey @adriangonz The file I'm using:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: multi-model
namespace: seldon
spec:
protocol: v2
name: multi-model
predictors:
- graph:
type: MODEL
implementation: SKLEARN_SERVER
modelUri: gs://<bucket>/MODELS/MultiModel
name: multi
parameters:
- name: method
type: STRING
value: predict
envSecretRefName: seldon-rclone-secret
componentSpecs:
- spec:
containers:
- name: multi
# image: local_repo/mlserver-sklearn:v1.0-no-setting
env:
- name: MLSERVER_MODEL_URI
value: ""
name: default
replicas: 1
Error (exact the same as original post): https://gist.github.com/yc2984/70c6690555db0b352956d28a7160fcd2
│ ERROR: Traceback (most recent call last): │
│ File "/usr/local/lib/python3.8/asyncio/runners.py", line 44, in run │
│ return loop.run_until_complete(main) │
│ File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete │
│ File "/usr/local/lib/python3.8/site-packages/mlserver/cli/main.py", line 44, in start │
│ await server.start(models_settings) │
│ File "/usr/local/lib/python3.8/site-packages/mlserver/server.py", line 98, in start │
│ await asyncio.gather( │
│ File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 272, in load │
│ return await self._models[model_settings.name].load(model_settings) │
│ File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 143, in load │
│ await self._load_model(new_model) │
│ File "/usr/local/lib/python3.8/site-packages/mlserver/registry.py", line 151, in _load_model │
│ await model.load() │
│ File "/usr/local/lib/python3.8/site-packages/mlserver_sklearn/sklearn.py", line 36, in load │
│ self._model = joblib.load(model_uri) │
│ File "/usr/local/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 579, in load │
│ with open(filename, 'rb') as f: │
│ IsADirectoryError: [Errno 21] Is a directory: '/mnt/models' │
│ │
│ During handling of the above exception, another exception occurred: │
│ │
│ Traceback (most recent call last): │
│ File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 638, in lifespan │
│ await receive() │
│ File "/usr/local/lib/python3.8/site-packages/uvicorn/lifespan/on.py", line 135, in receive │
│ return await self.receive_queue.get() │
│ File "/usr/local/lib/python3.8/asyncio/queues.py", line 163, in get │
│ await getter │
│ asyncio.exceptions.CancelledError
Hey @yc2984 ,
Thanks for sharing that.
It seems that you're using a custom image, would you be able to test with the standard SKLEARN_SERVER
pre-packaged server that comes out-of-the-box in Seldon Core? This will ensure there are no side effects coming from the custom image.
@adriangonz after updating configmap, I'm still able to reproduce this error with the default SKLEARN_SERVER.
"SKLEARN_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/sklearnserver"},"v2":{"defaultImageVersion":"1.1.0-sklearn","image":"seldonio/mlserver"}}}
apiVersion: v1
data:
credentials: '{"gcs":{"gcsCredentialFileName":"gcloud-application-credentials.json"},"s3":{"s3AccessKeyIDName":"awsAccessKeyID","s3SecretAccessKeyName":"awsSecretAccessKey"}}'
explainer: '{"image":"seldonio/alibiexplainer:1.14.0","image_v2":"seldonio/mlserver:1.1.0-alibi-explain"}'
predictor_servers: '{"SKLEARN_V_1_SERVER": {"protocols": {"seldon": {"defaultImageVersion":
"1.14.0", "image": "seldonio/sklearnserver"}, "v2": {"defaultImageVersion": "v1.0-no-setting",
"image": "gcr.io/ewx-tryout/mlserver-sklearn"}}} ,"HUGGINGFACE_SERVER":{"protocols":{"v2":{"defaultImageVersion":"1.1.0-huggingface","image":"seldonio/mlserver"}}},"MLFLOW_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/mlflowserver"},"v2":{"defaultImageVersion":"1.1.0-mlflow","image":"seldonio/mlserver"}}},"SKLEARN_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/sklearnserver"},"v2":{"defaultImageVersion":"1.1.0-sklearn","image":"seldonio/mlserver"}}},"TEMPO_SERVER":{"protocols":{"v2":{"defaultImageVersion":"1.1.0-slim","image":"seldonio/mlserver"}}},"TENSORFLOW_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/tfserving-proxy"},"tensorflow":{"defaultImageVersion":"2.1.0","image":"tensorflow/serving"}}},"TRITON_SERVER":{"protocols":{"v2":{"defaultImageVersion":"21.08-py3","image":"nvcr.io/nvidia/tritonserver"}}},"XGBOOST_SERVER":{"protocols":{"seldon":{"defaultImageVersion":"1.14.1","image":"seldonio/xgboostserver"},"v2":{"defaultImageVersion":"1.1.0-xgboost","image":"seldonio/mlserver"}}}}'
storageInitializer: '{"cpuLimit":"1","cpuRequest":"100m","image":"seldonio/rclone-storage-initializer:1.14.0","memoryLimit":"1Gi","memoryRequest":"100Mi"}'
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"credentials":"{\"gcs\":{\"gcsCredentialFileName\":\"gcloud-application-credentials.json\"},\"s3\":{\"s3AccessKeyIDName\":\"awsAccessKeyID\",\"s3SecretAccessKeyName\":\"awsSecretAccessKey\"}}","explainer":"{\"image\":\"seldonio/alibiexplainer:1.14.0\",\"image_v2\":\"seldonio/mlserver:1.1.0-alibi-explain\"}","predictor_servers":"{\"SKLEARN_V_1_SERVER\": {\"protocols\": {\"seldon\": {\"defaultImageVersion\": \"1.14.0\", \"image\": \"seldonio/sklearnserver\"}, \"v2\": {\"defaultImageVersion\": \"v1.0-no-setting\", \"image\": \"gcr.io/ewx-tryout/mlserver-sklearn\"}}} ,\"HUGGINGFACE_SERVER\":{\"protocols\":{\"v2\":{\"defaultImageVersion\":\"1.1.0-huggingface\",\"image\":\"seldonio/mlserver\"}}},\"MLFLOW_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/mlflowserver\"},\"v2\":{\"defaultImageVersion\":\"1.1.0-mlflow\",\"image\":\"seldonio/mlserver\"}}},\"SKLEARN_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/sklearnserver\"},\"v2\":{\"defaultImageVersion\":\"1.1.0-sklearn\",\"image\":\"seldonio/mlserver\"}}},\"TEMPO_SERVER\":{\"protocols\":{\"v2\":{\"defaultImageVersion\":\"1.1.0-slim\",\"image\":\"seldonio/mlserver\"}}},\"TENSORFLOW_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/tfserving-proxy\"},\"tensorflow\":{\"defaultImageVersion\":\"2.1.0\",\"image\":\"tensorflow/serving\"}}},\"TRITON_SERVER\":{\"protocols\":{\"v2\":{\"defaultImageVersion\":\"21.08-py3\",\"image\":\"nvcr.io/nvidia/tritonserver\"}}},\"XGBOOST_SERVER\":{\"protocols\":{\"seldon\":{\"defaultImageVersion\":\"1.14.1\",\"image\":\"seldonio/xgboostserver\"},\"v2\":{\"defaultImageVersion\":\"1.1.0-xgboost\",\"image\":\"seldonio/mlserver\"}}}}","storageInitializer":"{\"cpuLimit\":\"1\",\"cpuRequest\":\"100m\",\"image\":\"seldonio/rclone-storage-initializer:1.14.0\",\"memoryLimit\":\"1Gi\",\"memoryRequest\":\"100Mi\"}"},"kind":"ConfigMap","metadata":{"annotations":{"meta.helm.sh/release-name":"seldon-core","meta.helm.sh/release-namespace":"seldon-system"},"creationTimestamp":"2022-08-05T08:29:43Z","labels":{"app":"seldon","app.kubernetes.io/instance":"seldon-core","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"seldon-core-operator","app.kubernetes.io/version":"1.14.0","control-plane":"seldon-controller-manager"},"name":"seldon-config","namespace":"seldon-system","resourceVersion":"1364227","uid":"48da8e56-d2ba-4a61-bd5e-eb4d8b6baaea"}}
meta.helm.sh/release-name: seldon-core
meta.helm.sh/release-namespace: seldon-system
creationTimestamp: "2022-08-05T08:29:43Z"
labels:
app: seldon
app.kubernetes.io/instance: seldon-core
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: seldon-core-operator
app.kubernetes.io/version: 1.14.0
control-plane: seldon-controller-manager
name: seldon-config
namespace: seldon-system
resourceVersion: "1784725"
uid: 48da8e56-d2ba-4a61-bd5e-eb4d8b6baaea
Hey @yc2984 ,
After having a deeper look, we've managed to confirm that Seldon Core doesn't currently let you override the MLSERVER_MODEL_URI
env var through the SeldonDeployment
manifest (it's always forced to be /mnt/models
). This means that all the different models that you load from your model repository will get that URI by default, and thus will fail to load since their model artefacts are not in the /mnt/models
folder.
We have opened an issue on the Seldon Core repo to address this, which can be seen below:
https://github.com/SeldonIO/seldon-core/issues/4298
In the meantime though, as a temporary workaround, you can explicitly set each model's URI through the parameters.uri
field of their model-settings.json
file. For example, doing something like:
{
"name": "mnist-svm",
"implementation": "mlserver_sklearn.SKLearnModel",
"parameters": {
"version": "v0.1.0",
"uri": "model.joblib"
}
}
This will ensure that the models ignore the default value provided through the MLSERVER_MODEL_URI
env var and can get loaded correctly.
Note that, since this is not an MLServer issue, we will close it for now. However, feel free to continue the conversation on the new https://github.com/SeldonIO/seldon-core/issues/4298 ticket.
hi @adriangonz I tried what you said but it still doesn't work, having the same issue.
Hey @yc2984,
Could you provide more details of what you tried and how it failed?
Did you try updating all the model-settings.json
files in your model repo to include the parameters.uri
parameter?
Directory structure:
├── IrisModel
│ ├── model-settings.json
│ └── model.joblib
├── RandomForestModel
│ ├── model-settings.json
│ └── model.joblib
└── settings.json
Content of model-setting.json:
{
"name": "IrisModel",
"implementation": "mlserver_sklearn.SKLearnModel",
"parallel_workers": 1,
"parameters": {
"uri": "IrisModel/model.joblib"
}
}
{
"name": "RandomForestModel",
"implementation": "mlserver_sklearn.SKLearnModel",
"parallel_workers": 1,
"parameters": {
"uri": "RandomForestModel/model.joblib"
}
}
settings.json:
{
"debug": "true"
}
seldon deployment definition:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: multi-model
namespace: seldon
spec:
protocol: v2
name: multi-model
predictors:
- graph:
type: MODEL
implementation: SKLEARN_SERVER
modelUri: gs://bucket/MODELS/MultiModel
name: multi
parameters:
- name: method
type: STRING
value: predict
envSecretRefName: seldon-rclone-secret
name: default
replicas: 1
Hey @yc2984 ,
Thanks for sharing those.
A couple points from my side based on that:
- The
parameters.uri
field is relative to where themodel-settings.json
file lives. Therefore, it should be justmodel.joblib
on both cases. - For each new iteration, it would be really useful if you can also share the output from your server (i.e. "how" it fails), as it may be different from previous attempts.