seldon-core icon indicating copy to clipboard operation
seldon-core copied to clipboard

Model doesn't get redeployed after server and scheduler are killed together

Open SDJustus opened this issue 2 years ago • 4 comments

Describe the bug

As of Release version 2.6.0, there was a change, where the seldon-controller can be installed clusterwide. With this new feature, there now is a bug (with the seldon-controller still installed namespaced), that when the scheduler and a server are killed together, the models, that have been on the killed server, will not load again, since, in difference to the releases before, the controller doesn't restart itself when the scheduler gets killed. Thus, no redeployment of the models happens

To reproduce

  1. install release 2.6.0 with controller not clusterwide installed (still namespaced): following the helm install doc
  2. deploy a server
  3. deploy models on the server
  4. kill the scheduler and the server together -> No models redeployed and if you run kubectl describe on the models, there are still marked as being ready

Expected behaviour

For all models to be redeployed.

Environment

  • Cloud Provider: AWS EKS
  • Kubernetes Cluster Version: 1.24
  • Deployed Seldon System Images: Helm installation of release 2.6.0

Model Details

Not relevant.

SDJustus avatar Jul 18 '23 08:07 SDJustus

We have this issue as well. The model is not loaded again when a Server is restarted for any reason and the Model object still has status Ready.

It does help to remove the Model (after you remove finalizers manually per bug #5043), restart the scheduler statefulset, restart the Server once more and then redeploy the Model object again.

I believe there is something wrong in the scheduler. Maybe it keeps an incorrect state of the Model or does not reflect the Server restart at all?

Kolajik avatar Nov 10 '23 16:11 Kolajik

I think this particular issue came with the controller being able to be installed clusterwide... Before that, the Controller got restarted everytime something happened with the scheduler. I don't know, if this is already fixed in some of the pre Releases though..

SDJustus avatar Nov 10 '23 17:11 SDJustus

I am using the latest images of scheduler, controller and envoy. And that version should be higher than 2.6.0, so I don't think it was fixed.

Kolajik avatar Nov 10 '23 18:11 Kolajik

Can you point out if you are aware of this and can perhaps reproduce it? This seems like a critical scenario. Maybe @sakoush?

SDJustus avatar Dec 21 '23 09:12 SDJustus

A fix is now merged https://github.com/SeldonIO/seldon-core/pull/5411 @SDJustus

sakoush avatar Mar 11 '24 12:03 sakoush