cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Cluster restarts in sequence when config changes, rather than in parallel

Open peeveen opened this issue 8 months ago • 0 comments

I have a Helm chart containing a CrdbCluster resource.

This Helm chart creates certificate secrets (CA, node, root) then uses the names of those secrets in the nodeTLSSecret and clientTLSSecrets values of the CrdbCluster resource config.

If the secrets already exist (found using Helm's lookup function), the existing certificate data is used. In this way, repeating the helm install or helm upgrade commands won't keep creating new certificates.

However, if I manually delete the CA certificate secret, the chart will generate all-new certificate secrets, albeit with the same names. This means that the CrdbCluster config does not change, so the DB pods remain running with the old certificate data in their cockroach-certs folders. This is not what I'm aiming for. I want the DB pods to be rebuilt when the certificates change.

So I added some additionalAnnotations to the CrdbCluster resource, containing the SHA1 hashes of the certificate data. This looked like it was going to work, but something (the operator?) is attempting to restart the cluster pods one by one, rather than all at once. So the first cluster pod restarts (with the new certificate data contained within), and doesn't report as "healthy" until it has successfully contacted the other cluster pods, which it fails to do, as they still contain the old certificate data. The operator doesn't seem to want to move on rebuilding the next DB pod until this first one reports as healthy. The only way I can get them all to restart is to manually delete them.

Is it possible for the operator to restart them all at once? (actually, I'm assuming it's the operator that's controlling this, and not some fundamental k8s component ... maybe you could confirm this)

peeveen avatar Feb 05 '25 08:02 peeveen