helm-spray
helm-spray copied to clipboard
Inconsistency in Deployments and Statefulsets "wait" sequence between Spray v3 and v4 + bugs
Spray v3 is checking the completion of the upgrade process differently when "waiting" a Deployment compared to a Statefulset... while Spray v4 has a fully different algorithm for that... that is the same for both Deployments and Statefulsets! Who is right? who is wrong?
The current issue is an analysis of what are the various available counters for Deployments as well as for StatefulSets, in order to find the right way to detect the end of the upgrade process.
It consider the RollingUpdate
update strategy only. For StatefulSets, when deployed with a strategy OnDelete
, no "wait" shall be done (refer to issue https://github.com/ThalesGroup/helm-spray/issues/58).
Tests have been performed on Kubernetes 1.14... without knowing whether the results are changing on more recent versions, sorry for that...
Spray v3
For deployments
Here is a sequence of counters updates during a rolling update of a Deployment in the following situation:
- Number of replicas is unchanged at 2
- All Pods are restarted due to a change in their annotations
.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:1 .status.unavailableReplicas:1
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2 .status.unavailableReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2
Here is another sequence of counters updates during a rolling update of a Deployment in the following situation:
- Number of replicas is changed from 2 to 3
- Nothing changed in Pods definition => existing Pods are not restarted
.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2
.spec.replicas: 3 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:3 .status.unavailableReplicas:1
.spec.replicas: 3 .status.replicas:3 .status.availableReplicas:3 .status.readyReplicas:3 .status.updatedReplicas:3
Here is a final sequence of counters updates during a rolling update of a Deployment in the following situation:
- Number of replicas is changed from 3 to 2
- Change in the annotations => all existing Pods are also restarted
.spec.replicas: 3 .status.replicas:3 .status.availableReplicas:3 .status.readyReplicas:3 updatedReplicas:3
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 updatedReplicas:1 unavailableReplicas:1
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 updatedReplicas:2 unavailableReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 updatedReplicas:2
The algorithm currently implemented in Spray v3 to check the end of the "wait" sequence for Deployments is:
- If
.spec.replicas
!=.status.readyReplicas
then continue to wait - else
- if (
.spec.replicas
==.status.updatedReplicas
) and (.spec.replicas
==.status.replicas
) then STOP waiting - else continue to wait
- if (
It uses the .spec.replicas
, .status.replicas
, .status.updatedReplicas
and .status.readyReplicas
counters but not the .status.availableReplicas
and .status.unavailableReplicas
counters.
This algorithm looks to work fine, but maybe an easier algorithm might be to just check the .status.unavailableReplicas
counter and end waiting when it is no longer present...
For StatfulSets
Here is a sequence of counters updates during a rolling update of a StatefulSet in the following situation:
- Number of replicas is unchanged at 2
- All Pods are restarted due to a change in their annotations
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2
Here is another sequence of counters updates during a rolling update of a StatefulSet in the following situation:
- Number of replicas is changed from 2 to 3
- Nothing changed in Pods definition => existing Pods are not restarted
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2
.spec.replicas: 3 .status.replicas:3 .status.readyReplicas:2 .status.currentReplicas:3 .status.updatedReplicas:3
.spec.replicas: 3 .status.replicas:3 .status.readyReplicas:3 .status.currentReplicas:3 .status.updatedReplicas:3
Here is a final sequence of counters updates during a rolling update of a StatefulSet in the following situation:
- Number of replicas is changed from 3 to 2
- Change in the annotations => all existing Pods are also restarted
.spec.replicas: 3 .status.replicas:3 .status.readyReplicas:3 .status.currentReplicas:3 .status.updatedReplicas:3
.spec.replicas: 2 .status.replicas:3 .status.readyReplicas:3 .status.currentReplicas:2
.spec.replicas: 2 .status.replicas:3 .status.readyReplicas:2 .status.currentReplicas:2
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2
The algorithm currently implemented in Spray v3 to check the end of the "wait" sequence for StatefulSets is:
- If
.spec.replicas
!=.status.readyReplicas
then continue to wait - else
- if
.spec.replicas
==.status.currentReplicas
then STOP waiting - else continue to wait
- if
It uses the .spec.replicas
, .status.currentReplicas
and .status.readyReplicas
counters but not the .status.replicas
and .status.updatedReplicas
counters.
Remark: this algorithm does not manage correctly the 3rd sequence mentioned above: the waiting period ends immediately because there is no check between .spec.replicas
and .status.replicas
(as it is done for Deployments). A fix for that looks to be, for the second if
statement:
- if (
.spec.replicas
==.status.currentReplicas
) and (.spec.replicas
==.status.replicas
) then STOP waiting
Can we homogenize the 2 algorithms?
It is not possible that the second if
statement of the algorithm be exactly the same for both Deployments and StatefulSets because:
- Deployments do not have a
.status.currentReplicas
=> cannot use it - StatefulSets have the
.status.updatedReplicas
, but its value becomes equal to.spec.replicas
before all steps are completed (there is always a last line where the.status.currentReplicas
is set to the right value)
Spray v4
Sequences are supposed to be the same, as they depend only on Kubernetes, and not Spray itself. Note that I was unfortunately not able to test Spray v4 to confirm this.
In any case, the algorithms to check the end of the "wait" sequence are different from Spray v3: both Deployments and StatefulSets have the same algorithm:
- if
.status.readyReplicas
is defined (and not equal to 0 ?)- if
.status.readyReplicas
<.spec.replicas
then continue to wait - else STOP waiting
- if
- else continue to wait
(not sure about the analysis of the go-template => maybe to be confirmed...)
Issue?
Following my analysis, this algorithm unfortunately does NOT work for the sequences 1 and 3, both for Deployments and StatefulSet. This would need to be verified in practice... If so, they would have to be updated accordingly. Following the same algorithms implemented in Spray v3 ?