K8SSAND-1740 ⁃ Make decommission operations deterministic
Nodes should be decommissioned in a deterministic, balanced, way.
We should always choose the rack with the most nodes up and ready compared to the target size.
If the number of racks with the most nodes present is ever a tie we must sort by the rack names and decommission in the rack that comes first in the list.
There should never be a scenario where we are decommissioning multiple nodes in a single rack before making other racks in the DC reflect the same node count.
┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1740 ┆priority: Medium
Hey team! Please add your planning poker estimate with ZenHub @adejanovski @burmanm @Miles-Garnsey
See #381 .
Isn't this already deterministic? We calculate the new rack sizes by reducing the total nodeCount by 1. We then calculate new target size for each rack and compare if that matches or not.
Thus, lets say we scale from 9 -> 6, 3 racks:
First round, new target size 8:
rack1 -> new size 2, currentSize 3 rack2 -> new size 2, currentSize 3 rack3 -> new size 2, currentSize 3
Loop over racks, see that rack1 is not correct. Start scale down that rack to 2. Wait until completed.
Next round, new target size 7:
rack1 -> new size 2, currentSize 2 rack2 -> new size 2, currentSize 3 rack3 -> new size 2, currentSize 3
Loop over, rack1 is fine, go to the next one. rack2, wrong size, scale down. Wait until completed.
Next round, new target size 6:
rack1 -> new size 2, currentSize 2 rack2 -> new size 2, currentSize 2 rack3 -> new size 2, currentSize 3
Scale down rack3, wait.
And so on, even if you scaled down to 3 nodes it should not scale incorrectly. If it is, we have a bug (can you replicate?) but the feature should be there.
It does look from your example that it is deterministic already.
@bradfordcp can we close this? Asking since the issue was originally brought up by you.
How do we choose which rack to reduce?
Is the set of racks sorted before selection?
Is there a chance for rack2 and rack3 to get switched between runs in that set?
We always choose the next one, in the order they were defined in the CassandraDatacenter object.
So if you've set:
racks:
- r1
- r2
- r3
Then the order is: r1 -> r2 -> r3. And no, it shouldn't be possible that one round the order changes, not that I really get if that matters?
So I actually think we need to change a few things, similar to what I did in #403 . The way we decommission does not take into account the actual sts size, which may have been altered to create unbalanced racks.
Example: assuming dc.Spec.Size = 6, assume the following sts:
- rack1 2 replicas: node1a, node1b
- rack2 3 replicas: node2a, node2b, node2c
- rack3 2 replicas: node3a, node3b
In the example above, if we want to scale down from 6 to 3, and according to the desired exit criteria for this story, we need to first decommission the extra pod node2c.
Also, if we really want to be strict about order, it would be better to decommission in the inverse order of bootstrap, that is: node2c, then node3b, then node2b, then node1b. Not that I think it matters a lot but since i'm going to change the code, I can also enforce that.
@bradfordcp let me know if you are OK with all this.
In that scenario, when decommission starts, the size to compare is 7, not 6. dc.Spec.Size is not the one we compare to, it's the current amount of pods, calculated from the StatefulSets. So in your scenario, we do:
currentSize 7, targetSize 6 (since currentSize is larger than dc.Spec.Size, be it 6 or 3 or 0)
rack1 => correct rack2 => decommission rack3 => correct
You are right but my point is still valid I think, let's consider another example:
- rack1 3 replicas: node1a, node1b, node1c
- rack2 4 replicas: node2a, node2b, node2c, node2d
- rack3 3 replicas: node3a, node3b, node3c
If we want to scale down to 3, the first node to be decommissioned should be node2d. But with the current algorithm, node1c would go first.
Oh OK you were right all along. In my last example, node2d would go first since currentSize = 10, targetSize = 9. I opened #407 in the meantime, but we might not need it after all.
Unless #407 really fixes something that's not working, I'd rather not replace working functionality with another one. Opens up a possibility of a bug.
The only thing it does is reverse the rack order to be the opposite of declaration order.
@bradfordcp are you OK with decommission using the same rack order than bootstrapping?
The only minor downside of that that I can see is that, during a decommission process involving 2 or more nodes, the datacenter might temporarily expose an unusual topology; the resulting topology though will always be a correct one.
If everybody is OK we can close this ticket as well as #407.
Note: the documentation part of this is going to be done in k8ssandra repo, PR is up there:
https://github.com/k8ssandra/k8ssandra/pull/1505
I'm fine with either way.
Fair enough, closing.