modelmesh-serving "Specified runtime name not recognised" error in Inference Service

Describe the bug

When I create 1 custom serving runtime (SR1) and 1 inference service (IS1) that is to loaded onto SR1, I observe the expected behaviour whereby IS1 will have the failure message Waiting for runtime Pod to become available while the serving runtime is scaling up. Once the serving runtime pod is fully running, IS1 will immediately be loaded and be in Ready state.

However, if I create a second serving runtime (SR2) and a second inference service (IS2) which is loaded onto SR2, what I observe is that IS2 will have a failure message saying that "Specified runtime name not recognized" while SR2 is in the midst of scaling up. Once the SR2 pod has all containers in Running state, IS2 does not immediately get loaded. Instead, it will only get loaded after a few minutes when the ModelMesh controller's predictor reconciler picks it up again. This behaviour is unexpected as I would think that SR2 and IS2 should follow the same behaviour as SR1 and IS1.

To Reproduce

Serving Runtime and Inference Service templates used when observing this error:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-1
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: mlserver
  containers:
  - env:
    - name: MLSERVER_MODELS_DIR
      value: /models/
    - name: MLSERVER_GRPC_PORT
      value: "8001"
    - name: MLSERVER_HTTP_PORT
      value: "8002"
    - name: MLSERVER_LOAD_MODELS_AT_STARTUP
      value: "false"
    - name: MLSERVER_MODEL_NAME
      value: dummy-model
    - name: MLSERVER_HOST
      value: 127.0.0.1
    - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
      value: "16777216"
    - name: MLSERVER_DEBUG
      value: "true"
    - name: MLSERVER_MODEL_PARALLEL_WORKERS
      value: "0"
    - name: MLSERVER_MODEL_IMPLEMENTATION
      value: mlserver_sklearn.SKLearnModel
    image: seldonio/mlserver:1.2.3
    name: mlserver
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: "2"
        memory: 1Gi
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  multiModel: true
  replicas: 1
  storageHelper:
    disabled: false
  supportedModelFormats:
  - autoSelect: true
    name: custom
    version: "1"

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
    serving.kserve.io/secretKey: default
  name: sklearn-1
spec:
  predictor:
    model:
      modelFormat:
        name: custom
        version: "1"
      name: ""
      resources: {}
      runtime: mlserver-1
      storageUri: <s3 uri>

The model used for the inference service is the SKLearn model trained on the MNIST dataset that is documented here.

Steps to reproduce the behavior:

Within the same namespace, create 1 serving runtime with the above template. Then create 1 inference service with the above template after the serving runtime deployment has been created.
Wait for the serving runtime pod to have all containers in Running state and the inference service to be in Ready state.
Within the same namespace, create another serving runtime with the same template. Update the spec.predictor.model.runtime value in the inference service template to use the new serving runtime and create a new inference service.
The Specified runtime name not recognized error should appear in the status field of the new inference service.

Expected behavior

When the serving runtime is scaling up, the new inference service should be showing the failure message Waiting for runtime Pod to become available with reason RuntimeUnhealthy instead of the message Specified runtime name not recognized and reason RuntimeNotRecognized. The inference service should also be updated and loaded immediately after the serving runtime has scaled up rather than a couple of minutes after the serving runtime pod is ready.

Screenshots

Nil

Environment (please complete the following information):

OS: [e.g. iOS] MacOS
Browser [e.g. chrome, safari] Nil
Version [e.g. 22] KServe v0.9.0, Kubernetes v1.24.10

ModelMesh (v0.9.0) was installed on a fresh cluster using the --quickstart option.

Additional context

Some observations that might be related to the problem:

When I monitor the contents of etcd during the serving runtime and inference service creation, for the creation of SR1 and IS1, I noticed that the instances record was created before the vmodels and registry records. However, for the creation of SR2 and IS2, I noticed that the vmodels and registry records were created the moment the inference service was created, but the instances record was only created later after the SR2 pod was getting to Running state. Could this be why the controller could not find any running instances of SR2 for IS2 and it led to the Specified runtime name not recognised error?
When the serving runtimes and inference services are deleted from the cluster, I noticed that if I delete the inference service first before deleting the serving runtime, the corresponding etcd records are fully deleted. However, if I delete the serving runtime first, then delete the inference service after, the vmodels and registry records for the inference service are not deleted from etcd. Is this behaviour expected?

Apr 06 '23 05:04 xvnyv

Thanks again for opening up a very detailed issue!

In general, I don't think that the controller is hardened for a use-case where many SRs are created dynamically with ISVCs. The typical use is (1) Define all of the SRs that you plan to use (2) Create ISVCs as needed. I even consider scaleToZero to be mainly useful for dev/test in a resource constrained environment.

The boot up differences for SR1+ISVC1 versus SR2+ISVC2 can be explained by the behavior of scaleToZero (which is on by default). After an initial install without any ISVCs there will be no Runtime pods up which means that there are no ModelMesh containers running, effectively the ModelMesh cluster does not yet exist. Creation of the first ISVC will cause the ServingRuntime to be scaled up and pods to spin up. Until one of the pods come up, the ISVC cannot be registered because there is no MM container to connect to. This forces an ordering: ServingRuntime first, ISVC registration second.

When SR2+ISVC2 are created, the ISVC2 can be registered using the existing MM containers in SR1. From your observations, the registration of SR2 happens only after a pod is able to spin up, so there is time period when the ISVC is registered to use a Runtime that does not yet exists, which results in the Specified runtime name not recognised error.

So the fix would be to have the controller recognize that a ServingRuntime exists without having to wait for the pods to come up. I think the current behavior is that the controller just registers the ISVC with ModelMesh and grabs the error from ModelMesh, and ModelMesh doesn't know about the new Runtime until the pods come up.

The second behavior that you observed is more interesting:

However, if I delete the serving runtime first, then delete the inference service after, the vmodels and registry records for the inference service are not deleted from etcd. Is this behaviour expected?

I'll have to look into that. I think it is expected since the registration of the ISVC is somewhat independent from the existence of a Runtime. It seems a bit strange that deleting the ISVC wouldn't cause the cleanup, but I think all of the data in etcd has an expiry, so the entries will eventually be removed. It might lead to weirdness if you recreated the Runtime before the entries expire 🤔. Actually, does this happen only when deleting the last ServingRuntime? or can you delete SR2 while SR1 is still up and still observe this behavior for ISVC2? If it is only when removing both ServingRuntimes, I think the problem is that the entries in etcd are only removed by a request to a ModelMesh container, so the controller can't clean up an existing ISVC correctly unless there is an SR pod still up.

Apr 06 '23 19:04 tjohnson31415

@tjohnson31415 Thanks for the explanation! That was really helpful to understand how the controller is handling the registration of serving runtimes and inference services.

Unfortunately, my use case of ModelMesh requires that SRs can be created dynamically with ISVCs, and that the ISVCs should be loaded onto the SRs as quickly as possible. From my observations so far, it can take approximately 5min after the SR pod has come up fully before the ISVC gets reconciled and goes into Ready state, and this lag between the SR pod being up and the ISVC being ready is the main issue for me. If it would be too much trouble to change the behaviour of the controller to recognise that a SR exists without having to wait for the pods to come up, do you know if there is a way to shorten this lag time and allow the ISVC to reach Ready state more immediately after the SR pod is up?

Otherwise, to enable the controller to recognise that a SR exists without having to wait for the pods to come up, do you think it could be possible for the tc-config ConfigMap to be used to determine if an SR exists? I noticed that the type_constraints within this ConfigMap gets updated immediately after the creation of the SR resource without having to wait for the SR pod to scale up, even when scaleToZero is enabled, so I was wondering if it could be of use.

As for the second issue about the deletion of the etcd records, I have attempted to delete SR2 while SR1 is still up, then delete ISVC2 after that, and the vmodels and registry records for ISVC2 are indeed removed successfully. So it seems like you are most likely right about the cause of this issue.

Apr 07 '23 20:04 xvnyv

Hi, may I know if there are any intentions to address this issue (ie. modify the controller's behaviour to have it recognise that a ServingRuntime exists without having to wait for the pods to come up)?

May 24 '23 09:05 xvnyv

modelmesh-serving modelmesh-serving copied to clipboard

"Specified runtime name not recognised" error in Inference Service

modelmesh-serving
modelmesh-serving copied to clipboard