K8SSAND-1042 ⁃ Feature request: Add option to allow all pods to start in parallel
Currently, when resuming a stopped cluster, all the cassandra pods start up sequentially because the ips for the pods change and cassandra can only join one node at a time.
When using static IPs however, there is no concern about the IPs changing and therefore all the pods can start up in parallel.
An option to start all pods in parallel will significantly reduce the time to resume a large stopped cluster.
┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1042 ┆priority: Medium
I think we'll have to commit something here so that we can toggle on an implementation using static IPs so that this feature can be tested?
I'm curious how we could detect if the cluster is using static IPs or not. Just a boolean in the spec? I assume there is a sidecar or something that handles setting up the appropriate addresses and routing.
I'm curious how we could detect if the cluster is using static IPs or not. Just a boolean in the spec? I assume there is a sidecar or something that handles setting up the appropriate addresses and routing.
Yep that's what I was thinking, something like parallelResume: true. And yeah, a sidecar is handling the IP and route configuration.
Alternatively, we could add a flag something like useVirtualNetwork: true, and this would, in addition to starting pods in parallel, add the sidecars that enable the virtual network. Though this kind of addition to the operator would be somewhat more involved.
Let me ask the obvious, What are the risks of starting in parallel if static IPs are not used?
Please add your planning poker estimate with ZenHub @burmanm
I assume this would fall under the spec.networking key.
Do we still need to start seed nodes first before parallel starting the rest of the nodes?
Do we still need to start seed nodes first before parallel starting the rest of the nodes?
If we start the seed nodes first (one by one), it should allow us to start other nodes in parallel even if we're not using static IPs. These nodes will then be able to connect to the cluster through the seeds and broadcast their new IP address. The scenario that Cassandra doesn't deal well with is concurrent range movements, which will not be the case here.
@bradfordcp, can we move the ticket to the product backlog or does it require a design session?
@rchernobelskiy Is this still necessary feature?
From my personal perspective I still believe it would be a good feature to have.
I agree, there have been multiple incidents that were due to nodes which are already part of the ring being blocked from starting by cass-operator because another node was bootstrapping (which can take a while).
What we need to identify is if a node had previously bootstrapped, and allow it to start concurrently with other nodes in that case if we have at least one available seed node. We should detail this process a little bit to more precisely list the conditions that need to be met to enable this behavior.