cass-operator K8SSAND-1740 ⁃ Make decommission operations deterministic

Nodes should be decommissioned in a deterministic, balanced, way.

We should always choose the rack with the most nodes up and ready compared to the target size.

If the number of racks with the most nodes present is ever a tie we must sort by the rack names and decommission in the rack that comes first in the list.

There should never be a scenario where we are decommissioning multiple nodes in a single rack before making other racks in the DC reflect the same node count.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1740 ┆priority: Medium

Aug 17 '22 08:08 adutra

Hey team! Please add your planning poker estimate with ZenHub @adejanovski @burmanm @Miles-Garnsey

Aug 17 '22 08:08 adutra

See #381 .

Aug 17 '22 09:08 adutra

Isn't this already deterministic? We calculate the new rack sizes by reducing the total nodeCount by 1. We then calculate new target size for each rack and compare if that matches or not.

Thus, lets say we scale from 9 -> 6, 3 racks:

First round, new target size 8:

rack1 -> new size 2, currentSize 3 rack2 -> new size 2, currentSize 3 rack3 -> new size 2, currentSize 3

Loop over racks, see that rack1 is not correct. Start scale down that rack to 2. Wait until completed.

Next round, new target size 7:

rack1 -> new size 2, currentSize 2 rack2 -> new size 2, currentSize 3 rack3 -> new size 2, currentSize 3

Loop over, rack1 is fine, go to the next one. rack2, wrong size, scale down. Wait until completed.

Next round, new target size 6:

rack1 -> new size 2, currentSize 2 rack2 -> new size 2, currentSize 2 rack3 -> new size 2, currentSize 3

Scale down rack3, wait.

And so on, even if you scaled down to 3 nodes it should not scale incorrectly. If it is, we have a bug (can you replicate?) but the feature should be there.

Aug 17 '22 10:08 burmanm

It does look from your example that it is deterministic already.

@bradfordcp can we close this? Asking since the issue was originally brought up by you.

Aug 18 '22 09:08 adutra

How do we choose which rack to reduce?

Is the set of racks sorted before selection?

Is there a chance for rack2 and rack3 to get switched between runs in that set?

Aug 19 '22 14:08 bradfordcp

We always choose the next one, in the order they were defined in the CassandraDatacenter object.

So if you've set:

racks:
  - r1
  - r2
  - r3

Then the order is: r1 -> r2 -> r3. And no, it shouldn't be possible that one round the order changes, not that I really get if that matters?

Aug 19 '22 18:08 burmanm

So I actually think we need to change a few things, similar to what I did in #403 . The way we decommission does not take into account the actual sts size, which may have been altered to create unbalanced racks.

Example: assuming dc.Spec.Size = 6, assume the following sts:

rack1 2 replicas: node1a, node1b
rack2 3 replicas: node2a, node2b, node2c
rack3 2 replicas: node3a, node3b

In the example above, if we want to scale down from 6 to 3, and according to the desired exit criteria for this story, we need to first decommission the extra pod node2c.

Also, if we really want to be strict about order, it would be better to decommission in the inverse order of bootstrap, that is: node2c, then node3b, then node2b, then node1b. Not that I think it matters a lot but since i'm going to change the code, I can also enforce that.

@bradfordcp let me know if you are OK with all this.

Sep 13 '22 10:09 adutra

In that scenario, when decommission starts, the size to compare is 7, not 6. dc.Spec.Size is not the one we compare to, it's the current amount of pods, calculated from the StatefulSets. So in your scenario, we do:

currentSize 7, targetSize 6 (since currentSize is larger than dc.Spec.Size, be it 6 or 3 or 0)

rack1 => correct rack2 => decommission rack3 => correct

Sep 13 '22 11:09 burmanm

You are right but my point is still valid I think, let's consider another example:

rack1 3 replicas: node1a, node1b, node1c
rack2 4 replicas: node2a, node2b, node2c, node2d
rack3 3 replicas: node3a, node3b, node3c

If we want to scale down to 3, the first node to be decommissioned should be node2d. But with the current algorithm, node1c would go first.

Sep 13 '22 11:09 adutra

Oh OK you were right all along. In my last example, node2d would go first since currentSize = 10, targetSize = 9. I opened #407 in the meantime, but we might not need it after all.

Sep 13 '22 11:09 adutra

Unless #407 really fixes something that's not working, I'd rather not replace working functionality with another one. Opens up a possibility of a bug.

Sep 13 '22 12:09 burmanm

The only thing it does is reverse the rack order to be the opposite of declaration order.

Sep 13 '22 13:09 adutra

@bradfordcp are you OK with decommission using the same rack order than bootstrapping?

The only minor downside of that that I can see is that, during a decommission process involving 2 or more nodes, the datacenter might temporarily expose an unusual topology; the resulting topology though will always be a correct one.

If everybody is OK we can close this ticket as well as #407.

Sep 14 '22 16:09 adutra

Note: the documentation part of this is going to be done in k8ssandra repo, PR is up there:

https://github.com/k8ssandra/k8ssandra/pull/1505

Sep 14 '22 16:09 adutra

I'm fine with either way.

Sep 15 '22 20:09 bradfordcp

Fair enough, closing.

Sep 16 '22 13:09 adutra