Inconsistency in Deployments and Statefulsets "wait" sequence between Spray v3 and v4 + bugs

Open pamiel opened this issue 4 years ago • 0 comments

Spray v3 is checking the completion of the upgrade process differently when "waiting" a Deployment compared to a Statefulset... while Spray v4 has a fully different algorithm for that... that is the same for both Deployments and Statefulsets! Who is right? who is wrong?

The current issue is an analysis of what are the various available counters for Deployments as well as for StatefulSets, in order to find the right way to detect the end of the upgrade process. It consider the RollingUpdate update strategy only. For StatefulSets, when deployed with a strategy OnDelete, no "wait" shall be done (refer to issue https://github.com/ThalesGroup/helm-spray/issues/58). Tests have been performed on Kubernetes 1.14... without knowing whether the results are changing on more recent versions, sorry for that...

Spray v3

For deployments

Here is a sequence of counters updates during a rolling update of a Deployment in the following situation:

Number of replicas is unchanged at 2
All Pods are restarted due to a change in their annotations

.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:1 .status.unavailableReplicas:1
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2 .status.unavailableReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2

Here is another sequence of counters updates during a rolling update of a Deployment in the following situation:

Number of replicas is changed from 2 to 3
Nothing changed in Pods definition => existing Pods are not restarted

.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:2
.spec.replicas: 3 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 .status.updatedReplicas:3 .status.unavailableReplicas:1
.spec.replicas: 3 .status.replicas:3 .status.availableReplicas:3 .status.readyReplicas:3 .status.updatedReplicas:3

Here is a final sequence of counters updates during a rolling update of a Deployment in the following situation:

Number of replicas is changed from 3 to 2
Change in the annotations => all existing Pods are also restarted

.spec.replicas: 3 .status.replicas:3 .status.availableReplicas:3 .status.readyReplicas:3 updatedReplicas:3
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 updatedReplicas:1 unavailableReplicas:1
.spec.replicas: 2 .status.replicas:3 .status.availableReplicas:2 .status.readyReplicas:2 updatedReplicas:2 unavailableReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.availableReplicas:2 .status.readyReplicas:2 updatedReplicas:2

The algorithm currently implemented in Spray v3 to check the end of the "wait" sequence for Deployments is:

If .spec.replicas != .status.readyReplicas then continue to wait
else
- if (.spec.replicas == .status.updatedReplicas) and (.spec.replicas == .status.replicas) then STOP waiting
- else continue to wait

It uses the .spec.replicas, .status.replicas, .status.updatedReplicas and .status.readyReplicas counters but not the .status.availableReplicas and .status.unavailableReplicas counters.

This algorithm looks to work fine, but maybe an easier algorithm might be to just check the .status.unavailableReplicas counter and end waiting when it is no longer present...

For StatfulSets

Here is a sequence of counters updates during a rolling update of a StatefulSet in the following situation:

Number of replicas is unchanged at 2
All Pods are restarted due to a change in their annotations

.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:1                          
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1                          
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2                           .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1                           .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1                           .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2

Here is another sequence of counters updates during a rolling update of a StatefulSet in the following situation:

Number of replicas is changed from 2 to 3
Nothing changed in Pods definition => existing Pods are not restarted

.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2
.spec.replicas: 3 .status.replicas:3 .status.readyReplicas:2 .status.currentReplicas:3 .status.updatedReplicas:3
.spec.replicas: 3 .status.replicas:3 .status.readyReplicas:3 .status.currentReplicas:3 .status.updatedReplicas:3

Here is a final sequence of counters updates during a rolling update of a StatefulSet in the following situation:

Number of replicas is changed from 3 to 2
Change in the annotations => all existing Pods are also restarted

.spec.replicas: 3 .status.replicas:3 .status.readyReplicas:3 .status.currentReplicas:3 .status.updatedReplicas:3
.spec.replicas: 2 .status.replicas:3 .status.readyReplicas:3 .status.currentReplicas:2 
.spec.replicas: 2 .status.replicas:3 .status.readyReplicas:2 .status.currentReplicas:2 
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:1 
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1 
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1 .status.currentReplicas:1 .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2                           .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1                           .status.updatedReplicas:1
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:1                           .status.updatedReplicas:2
.spec.replicas: 2 .status.replicas:2 .status.readyReplicas:2 .status.currentReplicas:2 .status.updatedReplicas:2

The algorithm currently implemented in Spray v3 to check the end of the "wait" sequence for StatefulSets is:

If .spec.replicas != .status.readyReplicas then continue to wait
else
- if .spec.replicas == .status.currentReplicas then STOP waiting
- else continue to wait

It uses the .spec.replicas, .status.currentReplicas and .status.readyReplicas counters but not the .status.replicas and .status.updatedReplicas counters.

Remark: this algorithm does not manage correctly the 3rd sequence mentioned above: the waiting period ends immediately because there is no check between .spec.replicas and .status.replicas (as it is done for Deployments). A fix for that looks to be, for the second if statement:

if (.spec.replicas == .status.currentReplicas) and (.spec.replicas == .status.replicas) then STOP waiting

Can we homogenize the 2 algorithms?

It is not possible that the second if statement of the algorithm be exactly the same for both Deployments and StatefulSets because:

Deployments do not have a .status.currentReplicas => cannot use it
StatefulSets have the .status.updatedReplicas, but its value becomes equal to .spec.replicas before all steps are completed (there is always a last line where the .status.currentReplicas is set to the right value)

Spray v4

Sequences are supposed to be the same, as they depend only on Kubernetes, and not Spray itself. Note that I was unfortunately not able to test Spray v4 to confirm this.

In any case, the algorithms to check the end of the "wait" sequence are different from Spray v3: both Deployments and StatefulSets have the same algorithm:

if .status.readyReplicas is defined (and not equal to 0 ?)
- if .status.readyReplicas < .spec.replicas then continue to wait
- else STOP waiting
else continue to wait

(not sure about the analysis of the go-template => maybe to be confirmed...)

Issue?

Following my analysis, this algorithm unfortunately does NOT work for the sequences 1 and 3, both for Deployments and StatefulSet. This would need to be verified in practice... If so, they would have to be updated accordingly. Following the same algorithms implemented in Spray v3 ?

Jul 16 '20 18:07 pamiel

helm-spray helm-spray copied to clipboard

Inconsistency in Deployments and Statefulsets "wait" sequence between Spray v3 and v4 + bugs

Spray v3

For deployments

For StatfulSets

Can we homogenize the 2 algorithms?

Spray v4

Issue?

helm-spray
helm-spray copied to clipboard