seldon-core icon indicating copy to clipboard operation
seldon-core copied to clipboard

fix(operator): Add Status.AvailableReplicas to Model CRD

Open lc525 opened this issue 1 year ago • 1 comments

The Model.Status.Replicas field always reflects the total number of Model replicas in existence, irrespective of their status. This is because it is used by the scale subresource (enabling the configuration of HPA on Model CRs).

However, we need a way of showing how many of those Replicas are available in order to distinguish between various partial availability scenarios, when the ModelReady condition is set to false.

This PR introduces a new Status.AvailableReplicas field, showing exactly that.

The context of this change is the following:

  • Say we start with a cluster where Model A has 2 replicas, and the Server on which it is scheduled also has 2 replicas
  • For whatever reason (manual change to the Server CR, failure) the number of Server replicas goes down to 1
  • As a result, Model A's ModelReady status condition transitions to false, with a message of ScheduleFailed: Failed to schedule model as no matching server had enough suitable replicas
  • However, in reality there is still 1 replica of Model A, which remains available and can serve requests. Without additional data in the Model CR Status, there's no way of knowing that.

Out-of-scope

An additional, related issue is the interaction of this type of Model status transition and Pipelines/Experiments. For example, a pipeline that has Model A as a step, will listen to Model A's Status transitions and update the PipelineReady and ModelsReady conditions to false whenever one of the models is no longer "available" (as decided by the condition here). Deciding on a way of showing partial availability of models which are part of pipelines will be tracked as a separate issue.

In general, we should consider the ergonomics of clients interacting with the k8s "API" as defined by our CR status fields.

TODOs

  • [ ] smoke test in Kind cluster

Notes to reviewers:

This PR contains files that were automatically generated as part of the process of building the operator and the CRDs/helm-charts. I've left a comment where that's the case, describing how the file was generated.

lc525 avatar Sep 04 '24 17:09 lc525

There is a bug where the expected number of model replicas does not update after change in the number of server replicas. Will fix that part first before merging

lc525 avatar Sep 17 '24 10:09 lc525

There is a bug where the expected number of model replicas does not update after change in the number of server replicas. Will fix that part first before merging

I think this is now fixed, will revisit in a follow up PR if it resurfaced.

sakoush avatar Jan 29 '25 13:01 sakoush