KEP-4650: StatefulSet Support for Updating Volume Claim Template
- One-line PR description: initial proposal of KEP-4650: StatefulSet Support for Updating Volume Claim Template
- Issue link: #4650
- Other comments: Inspired by and should replace #3412
This KEP is inspired by the previous proposal in #3412 . However, there are several major differences, so I decided to make a new KEP for this.
Differences:
- No validation is performed at StatefulSet level. We accept everything, do what we can, and report what we have done in status. This should resolve the main concern of the previous attempt that it is hard to recover from expansion failure. Just
kubectl rollout undoshould work in this KEP. - We take extra care about backward compatibility. No immediate changes are expected when the feature is enabled.
- We accept changes to all fields, not just storage size. And explicitly consider what to do if the PVC does not match the template.
Please read more in the "Alternatives" section in the KEP.
This KEP also contains more details about how to coordinate the update of Pod and PVC, which is another main concern of the previous attempt.
/sig apps
/cc @smarterclayton @gnufied @soltysh Also cc @areller @maxin93 @jimethn You have joined the conversations in #3412, So you maybe also interested in this KEP.
Updated "Production Readiness Review Questionnaire" as requested. /assign @wojtek-t For PRR.
During the last sig-apps meeting, Q1: @kow3ns asked when we do want to modify PVC (e.g. migrate between storage providers) why we don't use the application level features, and the sts ordinal feature to migrate between 2 StatefulSet.
First, this KEP is still mainly targeting the use-cases where we can update the PVC in-place, without interrupting the running pods. The migration and other use-cases are just by-product of enabling editing the volumeClaimTemplates. But still, migration by editing sts is simpler, just requiring a rolling restart, which should already be familiar by most k8s operators. Editing the sts does not require switching traffic between 2 sts, for example. And the name of sts won't change after migration.
Yes, we cannot rollback easily with this procedure. The user should delete the PVC again to rollback. And the data in the PVC may not be recovered if retention policy is Delete. But rolling back the stateful application is inherently complex. Once a replica leaves the cluster, its state in the PV becomes stall. There is no guarantee that the old data can be used to rollback anyway. Whichever the procedure used, this complexity should be addressed at higher level. Custom migrators built on the ordinal feature should also face the same issue.
Q2: @soltysh suggested we may still go along the way of KEP-0661, then do more after we got more experience.
Here is why I don't want to proceed that way:
- We have VAC now, which is expected to go to beta soon. VAC is closely related to storage class, and can be patched to existing PVC. This KEP also integrates with VAC. Only updating VAC if storage class matches is a very logical choice.
- By allowing editing the whole template, we are forced to fully consider what to do if PVC and template are inconsistent. For example, the storage class differs. In KEP-0661, it is logical to just do the expansion anyway. But in this KEP, I think we should not expand it. Because the person who write the StatefulSet spec and the person who apply the change may not the same. We should not surprise the operator. And this is consistent with the VAC operation model. This is the divergency, and we cannot go through KEP-0661 and then to this KEP, or there will be breaking changes.
- KEP-0661 proposes not reverting the
volumeClaimTemplateswhen rollback the StatefulSet, which is very confusing. Another potential breaking change if we go that way first.
Please read more in the Alternatives section in this KEP.
@soltysh said we don't want the STS to be stuck in a permanently broken state. Of course. With this KEP, as we are not validating the templates. It is actually very hard to get stuck. Larger PVC is compatible with smaller template, so just rolling back the template should unblock us from future rollout, leaving PVCs in the expanding state, or try to cancel the expansion if RecoverVolumeExpansionFailure feature gate is enabled.
I think the only way we may get stuck is patching VAC, if one replica is successfully updated to the new VAC, but another replica failed. and rolling back the first replica to the old VAC also failed. Even in this case, the user can just set volumeClaimUpdateStrategy to OnDelete to unblock pod rollouts.
Q3: @kow3ns thinks it is not appropriate to delete and recreate the PVC to alter the performance characteristics of volumes.
VAC is the KEP that actually parameterizing the storage class and allow us to specify and update the performance characteristics of volumes without interrupting the running pod, by patching the existing PVC. So this KEP should also integrate with VAC. The update to VAC in the volumeClaimTemplates should not require re-creating the PVC, and is fully automated if everything goes well.
Q4: @kow3ns asked how we should handle each field of volumeClaimTemplates.
This is described in the KEP in "How to update PVCs" section in "Updated Reconciliation Logic". Basically, patch what we can, skip the rest.
It seems the wording "update PVC in-place" causes many mis-understandings. I will replace it with "patch PVC".
We didn’t actually decide anything during the last meeting. I think these core questions should be decided to push this KEP forward:
- Allow editing all fields vs only allow storage size and VAC?
- What to do if storage class does not match between template and PVC? Skip, block, or proceed anyway.
- Should
kubectl rollout undoaffectvolumeClaimTemplatesor not?
Here is why I don't want to proceed that way:
The general sentiment from that sig-apps call (see https://docs.google.com/document/d/1LZLBGW2wRDwAfdBNHJjFfk9CFoyZPcIYGWU7R1PQ3ng/edit#heading=h.2utc2e8dj14) was that the smaller changes have a greater chances to move forward. Also, it's worth noting that the smaller changes do not stand in opposition to the changes proposed in here, they are only taking the gradual approach by focusing on a minimal subset of changes.
Here is why I don't want to proceed that way:
The general sentiment from that sig-apps call (see https://docs.google.com/document/d/1LZLBGW2wRDwAfdBNHJjFfk9CFoyZPcIYGWU7R1PQ3ng/edit#heading=h.2utc2e8dj14) was that the smaller changes have a greater chances to move forward. Also, it's worth noting that the smaller changes do not stand in opposition to the changes proposed in here, they are only taking the gradual approach by focusing on a minimal subset of changes.
Okay, we agreed that focusing on the minimal subset of changes. @huww98 and @vie-serendipity will proceed with only VolumeClaimTemplates.spec.resources.requests.storage and VolumeClaimTemplates.spec. volumeAttributesClassName fields. and there are still two questions remaining to solve, maybe we can talk about it in the next sig-app meetings.
- What to do if storage class does not match between template and PVC? Skip, block, or proceed anyway.
- Should kubectl rollout undo affect volumeClaimTemplates or not?
At the last sig-apps meeting, we have decided that we should:
- Only allow editing of storage size in the template;
- If storage class does not match between template and PVC, block the update process and wait for user interaction. These changes should greatly shrink the scope of this KEP. And I think this is beneficial to get the KEP forward.
But for the validation of the template, I think we still need more discussion. It can be a major blocking point of this KEP. @soltysh think that we should not allow decreasing the size of template. He thinks we can remove the validation later if desired. But I think validation has many drawbacks which may block normal usage of this feature and should be resolved in the initial version:
- If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable IMO.
- To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). For example: check storage class
allowVolumeExpansion, check each PVC's storage class and size, basically duplicate all the validations we have done to PVC. And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. I see this as a major drawback of KEP-0661 that I want to avoid in this KEP. - Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change.
- The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha.
On the contrast, if we just don't add the validation, we can avoid all these issues, and lose nothing: The user can expand PVC independently today. So, the state that the template is smaller than PVC is already very common and stable. The strategy in this state is not even trying to shrink the PVC. I think this is well defined and easy to follow. If Kubernetes ever supports shrinking in the future, we will still need to support drivers that can't shrink. So, even then we can only support shrinking with a new volumeClaimUpdateStrategy (maybe InPlaceShrinkable).
To take one step back, I think validating the template across resources violates the high-level design. The template describes a desired final state, rather than an immediate instruction. A lot of things can happen externally after we update the template. For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. We don't want to reject it just because the StorageClass is applied after the StatefulSet, right?
To conclude, I don't want to add the validation, we don't add it just to be removed in the future.
On the contrast, if we just don't add the validation, we can avoid all these issues, and lose nothing: The user can expand PVC independently today. So, the state that the template is smaller than PVC is already very common and stable. The strategy in this state is not even trying to shrink the PVC. I think this is well defined and easy to follow. If Kubernetes ever supports shrinking in the future, we will still need to support drivers that can't shrink. So, even then we can only support shrinking with a new volumeClaimUpdateStrategy (maybe InPlaceShrinkable).
Agree. By the way, request means the minimal resource requirement, so the actual allocated which is larger than the request is actually reasonable. What we need is just show it to users.
- If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable IMO.
That's one way looking at it, also in those cases where a mistake happens (I consider that a rare occurrence), you can always use https://github.com/kubernetes/enhancements/issues/3335 and migrate to a new, smaller StatefulSet. Additionally, it's always easier to limit the scope of any enhancement and expand it once we confirm there will be sufficient use cases supporting it. The other way around, ie. removing functionality is either hard or more often not possible at all. We've done that in the past, and given the available solutions around "fixing a mistake", I'll stand strong with only allowing increasing the size.
@huww98 and @liubog2008 are you planning to push this through for 1.32?
Users have been waiting for many years to be able to scale up statefulset volumes. I agree we shouldn't over complicate that use case trying to handle solving other issues at the same time. Lets focus on the very common use case, and then can reevaluate other features after that is completed.
@huww98 and @liubog2008 are you planning to push this through for 1.32?
@soltysh Enhancement Freeze date is at Oct 11th right? We are on vacation right now and may not make this date. cc @huww98 @vie-serendipity
@soltysh Enhancement Freeze date is at Oct 11th right? We are on vacation right now and may not make this date. cc @huww98 @vie-serendipity
Roger that, let's try to work on the enhancement still within 1.32 time frame, so that it's ready in time for 1.33, in that case.
@soltysh Enhancement Freeze date is at Oct 11th right? We are on vacation right now and may not make this date. cc @huww98 @vie-serendipity
Roger that, let's try to work on the enhancement still within 1.32 time frame, so that it's ready in time for 1.33, in that case.
okay, that sounds great~
@huww98 see my question here
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: huww98 Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@huww98 I realize it has been a long slog but I hope you still have the motivation to carry this KEP through.
I've made a major revision of this KEP, taking in the experiences I've learnt these month.
Apart from filling more sections of the template, I've remove the concept of compatible PVC. Instead, I propose to add controller-revision-hash label to PVCs, and introduce ready concept to PVCs, making it more like Pods.
Instead of comparing PVC status to volumeClaimTemplates, I propose to only update PVC once per rollout, using controller-revision-hash label to record we have done the update. Then we just wait for PVC to become ready, which compares PVC status to its own spec. This allows user to edit the PVC after StatefulSet controller applies the change, maybe as a workaround to unblock rollout.
I think this makes the behavior very intuitive to who already knows how StatefulSet controller rollout pods. Also greatly simplified the implementation of StatefulSet controller itself.
To be discussed:
- Should we reuse
minReadySecondsto slow down PVC change rollout? I've seen people set this field to very large value to effectively pause the rollout for manual verification. - If feature gate is disabled when the
volumeClaimUpdatePolicyis set toInPlace, should we preserve thevolumeClaimTemplatesin the controllerRevision?- If not, every pod under STS with
volumeClaimUpdatePolicy=InPlaceneeds to update the label once.
- If not, every pod under STS with
@wojtek-t would you consider this enhancement from PRR perspective complete for alpha while we finalize some of the implementation details? I am asking this assuming today is PRR freeze.
@wojtek-t would you consider this enhancement from PRR perspective complete for alpha while we finalize some of the implementation details? I am asking this assuming today is PRR freeze.
PRR freeze is not to have it approved, but rather to have it in reviewable shape it definitely meets this bar, so we have a week to figure it out - I will take another pass tomorrow
/label tide/merge-method-squash
A new revision is uploaded: Order of PVC/Pod update is changed. We now update PVC and wait for it to be ready before deleting old pod:
- resolve the previous comments about quota and PVC update failure for other reasons
- avoid possible race condition described in KEP-5381
- remove the necessity of a new status field to monitor the PVC update process.
Currently unresolved discussions (correct me if I missed something):
- offline expansion: not likely to be included in this KEP. We can continue discussion for future KEP
- undo: we may allow reducing template size, or enhance kubectl to do partial undo. We will see as we gain more experience.
The enhancement freeze is close. Can we merge this PR and continue discuss and refine in future PR, as suggested in the KEP template.
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions focused. If you disagree with what is already in a document, open a new PR with suggested changes.
Offline expansion: not likely to be included in this KEP. We can continue discussion for future KEP.
I would like to see a way this can fail fast for drivers that only support offline expansion. We probably do not want users to attempt a statefulset update that will get wedged because volumes can't be expanded offline. We may not have to block KEP for that and can be discussed during API review as well.
undo: we may allow reducing template size, or enhance kubectl to do partial undo. We will see as we gain more experience.
I am curious why did we decide to store volumeClaimTemplate in controllerrevision? I guess this is the main reason we can't support undo right?
I think it would be preferable to design the feature with undo in mind: https://github.com/kubernetes/enhancements/pull/4651#discussion_r2154753011
Given lack of SIG agreement up until now, it's probably too late to get it into 1.34.
I would like to see a way this can fail fast for drivers that only support offline expansion.
@gnufied Can be hard. Kubernetes currently does not aware of whether a volume supports online resize.
I am curious why did we decide to store volumeClaimTemplate in controllerrevision?
As described it the other comment. I need to add PVC into revision, so that I will get a new revision when claim templates are updated and trigger the rollout, reusing the existing Pod rollout infrastructure for PVC.
Besides the above reason, we can still undo other fields of claim templates, and we may also support undo size change in the future.
And it is useful when creating new PVCs. quote the KEP:
When creating new PVCs, use the volumeClaimTemplates from the same revision that is used to create the Pod.
I review the StatefulSet controller code recently. And made these changes:
- We will not update PVCs when
OnDeleteupdateStrategyis used, as suggested by @atiratree - We will now update and wait for PVC ready even if the corresponding pod id already ready (update the PVC retrospectively, if I understand this word correctly). Because I realized the Pod may be deleted externally, e.g. evicted. We cannot just skip the PVC update if the Pod is deleted. This will also change the behavior when then feature-gate is enabled, disabled then enabled again. Check the new KEP revision for detail.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale