cass-operator icon indicating copy to clipboard operation
cass-operator copied to clipboard

K8SSAND-1042 ⁃ Feature request: Add option to allow all pods to start in parallel

Open rchernobelskiy opened this issue 4 years ago • 12 comments

Currently, when resuming a stopped cluster, all the cassandra pods start up sequentially because the ips for the pods change and cassandra can only join one node at a time.

When using static IPs however, there is no concern about the IPs changing and therefore all the pods can start up in parallel.

An option to start all pods in parallel will significantly reduce the time to resume a large stopped cluster.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1042 ┆priority: Medium

rchernobelskiy avatar Nov 08 '21 21:11 rchernobelskiy

I think we'll have to commit something here so that we can toggle on an implementation using static IPs so that this feature can be tested?

jimdickinson avatar Nov 08 '21 23:11 jimdickinson

I'm curious how we could detect if the cluster is using static IPs or not. Just a boolean in the spec? I assume there is a sidecar or something that handles setting up the appropriate addresses and routing.

bradfordcp avatar Apr 12 '22 04:04 bradfordcp

I'm curious how we could detect if the cluster is using static IPs or not. Just a boolean in the spec? I assume there is a sidecar or something that handles setting up the appropriate addresses and routing.

Yep that's what I was thinking, something like parallelResume: true. And yeah, a sidecar is handling the IP and route configuration.

Alternatively, we could add a flag something like useVirtualNetwork: true, and this would, in addition to starting pods in parallel, add the sidecars that enable the virtual network. Though this kind of addition to the operator would be somewhat more involved.

rchernobelskiy avatar Apr 12 '22 14:04 rchernobelskiy

Let me ask the obvious, What are the risks of starting in parallel if static IPs are not used?

jsanda avatar Apr 12 '22 16:04 jsanda

Please add your planning poker estimate with ZenHub @burmanm

jsanda avatar Apr 19 '22 22:04 jsanda

I assume this would fall under the spec.networking key.

bradfordcp avatar Apr 19 '22 22:04 bradfordcp

Do we still need to start seed nodes first before parallel starting the rest of the nodes?

bradfordcp avatar Apr 20 '22 13:04 bradfordcp

Do we still need to start seed nodes first before parallel starting the rest of the nodes?

If we start the seed nodes first (one by one), it should allow us to start other nodes in parallel even if we're not using static IPs. These nodes will then be able to connect to the cluster through the seeds and broadcast their new IP address. The scenario that Cassandra doesn't deal well with is concurrent range movements, which will not be the case here.

adejanovski avatar Jun 13 '22 09:06 adejanovski

@bradfordcp, can we move the ticket to the product backlog or does it require a design session?

adejanovski avatar Jun 13 '22 09:06 adejanovski

@rchernobelskiy Is this still necessary feature?

burmanm avatar Mar 05 '24 08:03 burmanm

From my personal perspective I still believe it would be a good feature to have.

rchernobelskiy avatar Mar 05 '24 13:03 rchernobelskiy

I agree, there have been multiple incidents that were due to nodes which are already part of the ring being blocked from starting by cass-operator because another node was bootstrapping (which can take a while).

What we need to identify is if a node had previously bootstrapped, and allow it to start concurrently with other nodes in that case if we have at least one available seed node. We should detail this process a little bit to more precisely list the conditions that need to be met to enable this behavior.

adejanovski avatar Jun 25 '24 14:06 adejanovski