enhancements
enhancements copied to clipboard
KEP-3335: StatefulSet Slice
- One-line PR description: Adding new KEP to support StatefulSet slice
- Issue link: https://github.com/kubernetes/enhancements/issues/3335
- Other comments:
Welcome @pwschuurman!
It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.
You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.
You can also check if kubernetes/enhancements has its own contribution guidelines.
You may want to refer to our testing guide if you run into trouble with your tests not passing.
If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!
Thank you, and welcome to Kubernetes. :smiley:
Hi @pwschuurman. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Another possible user story: I want to use kubefed to schedule a StatefulSet to multiple clusters, but doing so would result in a StatefulSet of [0, n) to be dispatched as e.g. [0, m) on member cluster 1 and [0, n - m) on member cluster 2. The ability to set the start ordinal as m
on member cluster 2 allows to correctly dispatch a StatefulSet of [m, n) to multiple member clusters.
Another possible user story: I want to use kubefed to schedule a StatefulSet to multiple clusters, but doing so would result in a StatefulSet of [0, n) to be dispatched as e.g. [0, m) on member cluster 1 and [0, n - m) on member cluster 2. The ability to set the start ordinal as
m
on member cluster 2 allows to correctly dispatch a StatefulSet of [m, n) to multiple member clusters.
Thanks @SOF3 for the use case. What is the benefit of having unique ordinals per clusters when using KubeFed? Does this feature help with fungibility of replicas across clusters, if KubeFed were to leverage this feature? (eg: ordinal k
could be moved from member cluster 1 to member cluster 2 without needing to change underlying PVC name of the replica).
Thanks @SOF3 for the use case. What is the benefit of having unique ordinals per clusters when using KubeFed? Does this feature help with fungibility of replicas across clusters, if KubeFed were to leverage this feature? (eg: ordinal
k
could be moved from member cluster 1 to member cluster 2 without needing to change underlying PVC name of the replica).
@pwschuurman Yes, it allows a special scheduler that can assign a different starting ordinal for each member cluster.
Currently a Deployment of 20 replicas may be distributed as
- cluster: a
overrides:
- {path: "/spec/replicas", value: 5}
- cluster: b
overrides:
- {path: "/spec/replicas", value: 15}
With the proposed changes, a StatefulSet may be distributed as
- cluster: a
overrides:
- {path: "/spec/replicas", value: 5}
- {path: "/spec/replicaStartOrdinal", value: 0}
- cluster: b
overrides:
- {path: "/spec/replicas", value: 15}
- {path: "/spec/replicaStartOrdinal", value: 5}
Then this would also generate a total of 20 pods with ordinals [0, 20) across all clusters.
@soltysh @kow3ns
Is there the future compatibility for disjoint slices with the same name?
The use case is like this: Initially, we schedule foo
to cluster A with 5 replicas and cluster B with 8 replicas, so cluster A has [0, 5) and cluster B has [5, 13). Later on, cluster B is running out of spare resources (still sufficient for the 8 replicas), but cluster A has got more resources. When we want to scale up foo
from 13 replicas to 16 replicas, we want to have cluster A with replicas [0, 5) U [13, 16) and cluster B with replicas [5, 13). It is not possible to create two StatefulSets because we cannot have two StatefulSets with the same name, such that foo-4
and foo-13
get created but foo-8
doesn't.
A single replicaStartIndex
would be confusing if this feature is ever added in the future.
Is there the future compatibility for disjoint slices with the same name?
The use case is like this: Initially, we schedule
foo
to cluster A with 5 replicas and cluster B with 8 replicas, so cluster A has [0, 5) and cluster B has [5, 13). Later on, cluster B is running out of spare resources (still sufficient for the 8 replicas), but cluster A has got more resources. When we want to scale upfoo
from 13 replicas to 16 replicas, we want to have cluster A with replicas [0, 5) U [13, 16) and cluster B with replicas [5, 13). It is not possible to create two StatefulSets because we cannot have two StatefulSets with the same name, such thatfoo-4
andfoo-13
get created butfoo-8
doesn't.A single
replicaStartIndex
would be confusing if this feature is ever added in the future.
Disjoint sets did come up in design, but it wasn't needed for the two use cases described in the original KEP. In addition to your use case, it could be useful in testing (eg: omitting a particular ordinal of a StatefulSet). The challenge with integrating an API for disjoint sets is that the replicas
ordinal may no longer be useful (either the disjoint set API is used, or the single set with replicas
is used). Consider a scenario where the StatefulSet spec is augmented with an array of slices ([0, 5), [13, 16)), to allow for a disjoint set. In that scenario, replicas
may carry redundant information (eg: does it refer to the total # of replicas, which is 8)?
If we do want disjoint sets, I think introducing a new CRD that represents the slice would make more sense (eg: similar to how EndpointSlices came to be: https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/), and introducing an API into StatefulSet to group of slices.
@JeremyOT @pmorie for sig-multicluster visibility
The goal of this KEP from the use cases described is to isolate slices of replicas across two StatefulSets. For an application orchestrator to migrate replicas, it can scale down replicas in StatefulSet A, and scale replicas up in StatefulSet B. For a StatefulSet using OrderedReady PodManagement, pods are scaled up starting for pod "0" to pod "N-1", (where there are N replicas), and scaled down from pod "N-1" to pod "0".
@soltysh raised the concern that this KEP introduces a significant change to the StatefulSet APIs. To reduce the complexity of this API change and implementation, rather than introducing a .spec.replicaStartOrdinal
, a new PodManagement policy could be introduced, named ReverseOrderedReady
. This would have the same semantics as OrderedReady
, but scaling pods in from "N-1" to "0". This would allow an orchestrator to complement the scale down of an OrderedReady
StatefulSet with the scale up of an ReverseOrderedReady
StatefulSet. This would narrow the API scope of this KEP compared to the proposed API .spec.replicaStartOrdinal
.
Once migrated, if an orchestrator that brings up a StatefulSet with ReverseOrderedReady
wants to change the StatefulSet to match OrderedReady
(as in the original cluster), the StatefulSet would need to be deleted (with --cascade=orphan
) and re-created with the updated PodManagement policy.
/ok-to-test
@soltysh raised the concern that this KEP introduces a significant change to the StatefulSet APIs. To reduce the complexity of this API change and implementation, rather than introducing a .spec.replicaStartOrdinal, a new PodManagement policy could be introduced, named ReverseOrderedReady
I think I might argue that, if the goal is to enable orchestration, ordinal start is more useful than an opaque policy. Opaque policy is less flexible, and if we don't have a lot of users of ReverseOrderedReady, we'd probably be making more people happy with having 1 based start ordinals AND allowing slice subdivision. Also, if you have parallel mode, you still need a way to orchestrate those and preserve non-duplication of ordinals. So start is more useful.
Re: alternates considered, let's make sure to list the alternate designs and describe the key arguments for each. I'll add a few if you miss any.
Re: alternates considered, let's make sure to list the alternate designs and describe the key arguments for each. I'll add a few if you miss any.
Thanks. I captured the reverse PodManagementPolicy alternative.
I'm fine with the KEP from a technical pov from sig-apps side, but you need to update the metadata so that it matches reality (approvers, reviewers, owning sig, participating sig, PRR approver) so folks can approve this.
I'm fine with the KEP from a technical pov from sig-apps side, but you need to update the metadata so that it matches reality (approvers, reviewers, owning sig, participating sig, PRR approver) so folks can approve this.
Thanks @soltysh, updated the metadata to match the current state of reviewers and owning sig.
although the verify failures is to be fixed
Thanks @soltysh, updated the README.md TOC.
/approve PRR
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: pwschuurman, soltysh, wojtek-t
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~keps/prod-readiness/OWNERS~~ [wojtek-t]
- ~~keps/sig-apps/OWNERS~~ [soltysh]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/lgtm
/hold for @smarterclayton - to confirm
/lgtm
as well, thank you for bearing with me as we threaded the future growth of ordinal needle.
/hold cancel