seldon-core
seldon-core copied to clipboard
fix(operator): Add Status.AvailableReplicas to Model CRD
The Model.Status.Replicas field always reflects the total number of Model replicas in existence, irrespective of their status. This is because it is used by the scale subresource (enabling the configuration of HPA on Model CRs).
However, we need a way of showing how many of those Replicas are available in order to distinguish between various partial availability scenarios, when the ModelReady condition is set to false.
This PR introduces a new Status.AvailableReplicas field, showing exactly that.
The context of this change is the following:
- Say we start with a cluster where Model A has 2 replicas, and the Server on which it is scheduled also has 2 replicas
- For whatever reason (manual change to the Server CR, failure) the number of Server replicas goes down to 1
- As a result, Model A's
ModelReadystatus condition transitions tofalse, with a message ofScheduleFailed: Failed to schedule model as no matching server had enough suitable replicas - However, in reality there is still 1 replica of Model A, which remains available and can serve requests. Without additional data in the Model CR Status, there's no way of knowing that.
Out-of-scope
An additional, related issue is the interaction of this type of Model status transition and Pipelines/Experiments. For example, a pipeline that has Model A as a step, will listen to Model A's Status transitions and update the PipelineReady and ModelsReady conditions to false whenever one of the models is no longer "available" (as decided by the condition here). Deciding on a way of showing partial availability of models which are part of pipelines will be tracked as a separate issue.
In general, we should consider the ergonomics of clients interacting with the k8s "API" as defined by our CR status fields.
TODOs
- [ ] smoke test in Kind cluster
Notes to reviewers:
This PR contains files that were automatically generated as part of the process of building the operator and the CRDs/helm-charts. I've left a comment where that's the case, describing how the file was generated.
There is a bug where the expected number of model replicas does not update after change in the number of server replicas. Will fix that part first before merging
There is a bug where the expected number of model replicas does not update after change in the number of server replicas. Will fix that part first before merging
I think this is now fixed, will revisit in a follow up PR if it resurfaced.