enhancements KEP-3335: StatefulSet Slice

One-line PR description: Adding new KEP to support StatefulSet slice

Issue link: https://github.com/kubernetes/enhancements/issues/3335

Other comments:

Jun 03 '22 06:06 pwschuurman

Welcome @pwschuurman!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

Jun 03 '22 06:06 k8s-ci-robot

Hi @pwschuurman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 03 '22 06:06 k8s-ci-robot

Another possible user story: I want to use kubefed to schedule a StatefulSet to multiple clusters, but doing so would result in a StatefulSet of [0, n) to be dispatched as e.g. [0, m) on member cluster 1 and [0, n - m) on member cluster 2. The ability to set the start ordinal as m on member cluster 2 allows to correctly dispatch a StatefulSet of [m, n) to multiple member clusters.

Aug 05 '22 02:08 SOF3

Another possible user story: I want to use kubefed to schedule a StatefulSet to multiple clusters, but doing so would result in a StatefulSet of [0, n) to be dispatched as e.g. [0, m) on member cluster 1 and [0, n - m) on member cluster 2. The ability to set the start ordinal as m on member cluster 2 allows to correctly dispatch a StatefulSet of [m, n) to multiple member clusters.

Thanks @SOF3 for the use case. What is the benefit of having unique ordinals per clusters when using KubeFed? Does this feature help with fungibility of replicas across clusters, if KubeFed were to leverage this feature? (eg: ordinal k could be moved from member cluster 1 to member cluster 2 without needing to change underlying PVC name of the replica).

Aug 05 '22 21:08 pwschuurman

Thanks @SOF3 for the use case. What is the benefit of having unique ordinals per clusters when using KubeFed? Does this feature help with fungibility of replicas across clusters, if KubeFed were to leverage this feature? (eg: ordinal k could be moved from member cluster 1 to member cluster 2 without needing to change underlying PVC name of the replica).

@pwschuurman Yes, it allows a special scheduler that can assign a different starting ordinal for each member cluster.

Currently a Deployment of 20 replicas may be distributed as

- cluster: a
  overrides:
    - {path: "/spec/replicas", value: 5}
- cluster: b
  overrides:
    - {path: "/spec/replicas", value: 15}

With the proposed changes, a StatefulSet may be distributed as

- cluster: a
  overrides:
    - {path: "/spec/replicas", value: 5}
    - {path: "/spec/replicaStartOrdinal", value: 0}
- cluster: b
  overrides:
    - {path: "/spec/replicas", value: 15}
    - {path: "/spec/replicaStartOrdinal", value: 5}

Then this would also generate a total of 20 pods with ordinals [0, 20) across all clusters.

Aug 06 '22 06:08 SOF3

@soltysh @kow3ns

Sep 15 '22 11:09 janetkuo

Is there the future compatibility for disjoint slices with the same name?

The use case is like this: Initially, we schedule foo to cluster A with 5 replicas and cluster B with 8 replicas, so cluster A has [0, 5) and cluster B has [5, 13). Later on, cluster B is running out of spare resources (still sufficient for the 8 replicas), but cluster A has got more resources. When we want to scale up foo from 13 replicas to 16 replicas, we want to have cluster A with replicas [0, 5) U [13, 16) and cluster B with replicas [5, 13). It is not possible to create two StatefulSets because we cannot have two StatefulSets with the same name, such that foo-4 and foo-13 get created but foo-8 doesn't.

A single replicaStartIndex would be confusing if this feature is ever added in the future.

Sep 16 '22 09:09 SOF3

Is there the future compatibility for disjoint slices with the same name?

The use case is like this: Initially, we schedule foo to cluster A with 5 replicas and cluster B with 8 replicas, so cluster A has [0, 5) and cluster B has [5, 13). Later on, cluster B is running out of spare resources (still sufficient for the 8 replicas), but cluster A has got more resources. When we want to scale up foo from 13 replicas to 16 replicas, we want to have cluster A with replicas [0, 5) U [13, 16) and cluster B with replicas [5, 13). It is not possible to create two StatefulSets because we cannot have two StatefulSets with the same name, such that foo-4 and foo-13 get created but foo-8 doesn't.

A single replicaStartIndex would be confusing if this feature is ever added in the future.

Disjoint sets did come up in design, but it wasn't needed for the two use cases described in the original KEP. In addition to your use case, it could be useful in testing (eg: omitting a particular ordinal of a StatefulSet). The challenge with integrating an API for disjoint sets is that the replicas ordinal may no longer be useful (either the disjoint set API is used, or the single set with replicas is used). Consider a scenario where the StatefulSet spec is augmented with an array of slices ([0, 5), [13, 16)), to allow for a disjoint set. In that scenario, replicas may carry redundant information (eg: does it refer to the total # of replicas, which is 8)?

If we do want disjoint sets, I think introducing a new CRD that represents the slice would make more sense (eg: similar to how EndpointSlices came to be: https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/), and introducing an API into StatefulSet to group of slices.

Sep 16 '22 19:09 pwschuurman

@JeremyOT @pmorie for sig-multicluster visibility

Sep 16 '22 19:09 pwschuurman

The goal of this KEP from the use cases described is to isolate slices of replicas across two StatefulSets. For an application orchestrator to migrate replicas, it can scale down replicas in StatefulSet A, and scale replicas up in StatefulSet B. For a StatefulSet using OrderedReady PodManagement, pods are scaled up starting for pod "0" to pod "N-1", (where there are N replicas), and scaled down from pod "N-1" to pod "0".

@soltysh raised the concern that this KEP introduces a significant change to the StatefulSet APIs. To reduce the complexity of this API change and implementation, rather than introducing a .spec.replicaStartOrdinal, a new PodManagement policy could be introduced, named ReverseOrderedReady. This would have the same semantics as OrderedReady, but scaling pods in from "N-1" to "0". This would allow an orchestrator to complement the scale down of an OrderedReady StatefulSet with the scale up of an ReverseOrderedReady StatefulSet. This would narrow the API scope of this KEP compared to the proposed API .spec.replicaStartOrdinal.

Once migrated, if an orchestrator that brings up a StatefulSet with ReverseOrderedReady wants to change the StatefulSet to match OrderedReady (as in the original cluster), the StatefulSet would need to be deleted (with --cascade=orphan) and re-created with the updated PodManagement policy.

Sep 22 '22 19:09 pwschuurman

/ok-to-test

Sep 27 '22 07:09 alculquicondor

@soltysh raised the concern that this KEP introduces a significant change to the StatefulSet APIs. To reduce the complexity of this API change and implementation, rather than introducing a .spec.replicaStartOrdinal, a new PodManagement policy could be introduced, named ReverseOrderedReady

I think I might argue that, if the goal is to enable orchestration, ordinal start is more useful than an opaque policy. Opaque policy is less flexible, and if we don't have a lot of users of ReverseOrderedReady, we'd probably be making more people happy with having 1 based start ordinals AND allowing slice subdivision. Also, if you have parallel mode, you still need a way to orchestrate those and preserve non-duplication of ordinals. So start is more useful.

Re: alternates considered, let's make sure to list the alternate designs and describe the key arguments for each. I'll add a few if you miss any.

Sep 30 '22 16:09 smarterclayton

Re: alternates considered, let's make sure to list the alternate designs and describe the key arguments for each. I'll add a few if you miss any.

Thanks. I captured the reverse PodManagementPolicy alternative.

Oct 03 '22 23:10 pwschuurman

I'm fine with the KEP from a technical pov from sig-apps side, but you need to update the metadata so that it matches reality (approvers, reviewers, owning sig, participating sig, PRR approver) so folks can approve this.

Oct 06 '22 10:10 soltysh

I'm fine with the KEP from a technical pov from sig-apps side, but you need to update the metadata so that it matches reality (approvers, reviewers, owning sig, participating sig, PRR approver) so folks can approve this.

Thanks @soltysh, updated the metadata to match the current state of reviewers and owning sig.

Oct 06 '22 14:10 pwschuurman

although the verify failures is to be fixed

Thanks @soltysh, updated the README.md TOC.

Oct 06 '22 15:10 pwschuurman

/approve PRR

Oct 06 '22 16:10 wojtek-t

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pwschuurman, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-apps/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 06 '22 16:10 k8s-ci-robot

/lgtm

/hold for @smarterclayton - to confirm

Oct 06 '22 16:10 wojtek-t

/lgtm

as well, thank you for bearing with me as we threaded the future growth of ordinal needle.

/hold cancel

Oct 06 '22 17:10 smarterclayton

enhancements enhancements copied to clipboard

KEP-3335: StatefulSet Slice

enhancements
enhancements copied to clipboard