cockroach-operator [Regression] Decommission is broken

[Regression] Decommission is broken

Open keith-mcclellan opened this issue 3 years ago • 18 comments

The statefulset is stopping the pod before cockroach node decommission is being executed, so the node is showing up as failed instead of as decommissioned.

The PVC also gets deleted so there is no way to recover from this state if the scale down caused a loss of quorum because the data gets destroyed.

Steps to reproduce: create 4 node cluster change node count to 3 nodes Screen Shot 2021-05-28 at 12 26 31 PM decommission-regression.log

May 28 '21 16:05 keith-mcclellan

the PVC gets deleted as well so there is no way to recover from this state.

May 28 '21 16:05 keith-mcclellan

the PVC gets deleted as well so there is no way to recover from this state.

@keith-mcclellan so we keep the PVC, and we delete them only if the decommission command it is successfull? @chrisseto I moved the Prune after after the downscale code. Now the reason that the decommission gave an error is still to be determined.

// Before doing any scaling, prune any PVCs that are not currently in use.
	// This only needs to be done when scaling up but the operation is a noop
	// if there are no PVCs not currently in use.
	// As of v20.2.0, CRDB nodes may not be recommissioned. To account for
	// this, PVCs must be removed (pruned) before scaling up to avoid reusing a
	// previously decommissioned node.
	// Prune MUST be called before scaling as older clusters may have dangling
	// PVCs.
	// All underlying PVs and the storageclasses they were created with should
	// make use of reclaim policy = delete. A reclaim policy of retain is fine
	// but will result in wasted money, recycle should be considered unsafe and
	// is officially deprecated by kubernetes.
	if err := s.PVCPruner.Prune(ctx); err != nil {
		return errors.Wrap(err, "initial PVC pruning")
	}

May 31 '21 18:05 alinadonisa

Decommission should run as follows functionally:

Validate that the node count after decommission is still >= 3 (node decommissions CAN be run in parallel)
Run cockroach node decommission
If fails, annotate the CR to stop further runs without user input, log an error, and reset the node count to the original amount *** optionally, we should roll back the decommission with a recommission command ***
If successful, wait 60 seconds and run cockroach node status --decommission (or optionally cockroach node status --all) to validate that the node is decommissioned and the database is ready for it to exit the cluster
If cockroach node status --decommission does not show the node as decommissioned, do same as #3
Stop the decommissioned pod gracefully (pre-stop hook et al same as a rolling restart)
Run health checker to validate that we have 0 under-replicated ranges using the same pattern as a rolling restart
Delete the SS and PVC

Decommission tests:

Positive case 1 - After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Positive case 2 - Annotate the CR and validate that the decommission actor doesn't run on a CR update

Negative case 1 - While decommission is running, stop the pod being decommissioned. This should cause decommission to fail. We should verify that the rest of the decommission process doesn't proceed and that the annotation is set.

Negative case 2 - Change node count after annotation is set, operator should throw an error

@udnay

Jun 01 '21 15:06 keith-mcclellan

ref: https://www.cockroachlabs.com/docs/v21.1/cockroach-node.html ref 2: https://www.cockroachlabs.com/docs/v21.1/cockroach-node.html#flags ref 3: https://www.cockroachlabs.com/docs/v21.1/remove-nodes.html

Jun 01 '21 15:06 keith-mcclellan

Positive case 1 - After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Is the command run on the node itself?

Positive case 2 - Annotate the CR and validate that the decommission actor doesn't run on a CR update

This should be a unit-test, I think we just need to test the handle command for the actor.

Negative case 2 - Change node count after annotation is set, operator should throw an error

This should also be a unit-test

Jun 01 '21 16:06 udnay

Positive case 1 - After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as decommissioned.

Is the command run on the node itself?

You can, but it's got an interface so you can run it from anywhere that you have the database binary.

Jun 01 '21 16:06 keith-mcclellan

cockroach node status --decommission --certs-dir=certs --host=<address of any live node>

 id |        address         |  build  |            started_at            |            updated_at            | is_available | is_live | gossiped_replicas | is_decommissioning | is_draining  
+---+------------------------+---------+----------------------------------+----------------------------------+--------------+---------+-------------------+--------------------+-------------+
  1 | 165.227.60.76:26257    | 91a299d | 2018-10-01 16:53:10.946245+00:00 | 2018-10-02 14:04:39.280249+00:00 |         true |  true   |                26 |       false        |    false     
  2 | 192.241.239.201:26257  | 91a299d | 2018-10-01 16:53:24.22346+00:00  | 2018-10-02 14:04:39.415235+00:00 |         true |  true   |                26 |       false        |    false     
  3 | 67.207.91.36:26257     | 91a299d | 2018-10-01 17:34:21.041926+00:00 | 2018-10-02 14:04:39.233882+00:00 |         true |  true   |                25 |       false        |    false     
  4 | 138.197.12.74:26257    | 91a299d | 2018-10-01 17:09:11.734093+00:00 | 2018-10-02 14:04:37.558204+00:00 |         true |  true   |                25 |       false        |    false     
  5 | 174.138.50.192:26257   | 91a299d | 2018-10-01 17:14:01.480725+00:00 | 2018-10-02 14:04:39.293121+00:00 |         true |  true   |                 0 |        true        |    false

This is an example of a decommissioned node that is ready to be stopped - is_decommissioning is true and gossiped_replicas = 0 means its done. We can then gracefully stop the pod.

Jun 01 '21 17:06 keith-mcclellan

@udnay Please document what the correct workflow is for decommision. We are getting differing opinions.

Jun 01 '21 18:06 chrislovecnm

Weighing in on behalf of @udnay and at the request of @alinadonisa.

The logic in EnsureScale is the same logic that we use in cockroach cloud, which has been pretty well battled tested at this point. The one notable difference is that we remove Kubernetes nodes in the CC version, but that doesn't affect to core logic.

From what I can tell, that logic simply isn't running to completion or isn't running at all. Does anyone have the logs available of a failed decommission? The decommissioner is very verbose, it should be pretty easy to tell where something is going wrong based on the logs.

The PVC pruner will only remove the volumes of pods that are not currently running and have an ordinal less than the number of desired replicas. It sounds like something is changing the desired number of replicas out side of the call to EnsureScale.

It does everything that Keith has suggested sans the under replicated system check but that could be easily plugged into WaitUntilHealthy function.

Jun 01 '21 19:06 chrisseto

Looking at the logs attached I see

{"level":"error","ts":1622218735.5067515,"logger":"action","msg":"decomission failed","action":"decommission","CrdbCluster":"default/crdb-tls-example","error":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:208\n  | [...repeated from below...]\nWraps: (2) failed to start draining node 4\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:207\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:86\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:87\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:169\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:130\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1371\nWraps: (4) failed to stream execution results back\nWraps: (5) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:171\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1622218735.5077403,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5077748,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5079463,"logger":"action","msg":"no version changes needed","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5079763,"logger":"controller.CrdbCluster","msg":"Running action with index: 4 and  name: ResizePVC","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5095117,"logger":"action","msg":"Skipping PVC resize as sizes match","action":"resize_pvc","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218735.5095623,"logger":"controller.CrdbCluster","msg":"Running action with index: 5 and  name: Deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5095706,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257

I see a failure in the decomission code due to maybe port forwarding or something because the operator isn't running in cluster ailed to start draining node 4: failed to stream execution results back: command terminated with exit code 1",

Then we start to run the deploy actor runs and I see {"level":"info","ts":1622218735.5339012,"logger":"action","msg":"created/updated statefulset, stopping request processing","action":"deploy","CrdbCluster":"default/crdb-tls-example"} Which seems to be Lines 116-134

What does reconciler do? Can it be shutting down the extra pod after decommsion failed?

Jun 01 '21 20:06 udnay

After this the logs show decommission failing because not all replicas are up, I believe @chrisseto is probably correct or at least onto something.

Jun 01 '21 20:06 udnay

Seems like the error handling for failed decommissioning is busted? Though I'm not sure why the decommission command would fail...

Jun 01 '21 20:06 chrisseto

I'm not questioning that the cc drainer works properly, I'm questioning whether we implemented it properly. Something is stopping the pod before the decommission is complete...see

{"level":"warn","ts":1622218688.9847136,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257
{"level":"info","ts":1622218689.0102832,"logger":"action","msg":"deployed database","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.010324,"logger":"controller.CrdbCluster","msg":"Running action with index: 7 and  name: ClusterRestart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0103347,"logger":"action","msg":"starting cluster restart action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.010339,"logger":"action","msg":"No restart cluster action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.015668,"logger":"controller.CrdbCluster","msg":"reconciliation completed","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0157423,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158072,"logger":"controller.CrdbCluster","msg":"Running action with index: 0 and  name: Decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.015832,"logger":"action","msg":"check decommission oportunities","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158622,"logger":"action","msg":"replicas decommisioning","action":"decommission","CrdbCluster":"default/crdb-tls-example","status.CurrentReplicas":4,"expected":4}
{"level":"info","ts":1622218689.0158727,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0158768,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0158906,"logger":"action","msg":"no version changes needed","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0159137,"logger":"controller.CrdbCluster","msg":"Running action with index: 4 and  name: ResizePVC","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0167642,"logger":"action","msg":"Skipping PVC resize as sizes match","action":"resize_pvc","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0167942,"logger":"controller.CrdbCluster","msg":"Running action with index: 5 and  name: Deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0168018,"logger":"action","msg":"reconciling resources on deploy action","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
26257
{"level":"info","ts":1622218689.0354652,"logger":"action","msg":"deployed database","action":"deploy","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0355067,"logger":"controller.CrdbCluster","msg":"Running action with index: 7 and  name: ClusterRestart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0355196,"logger":"action","msg":"starting cluster restart action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0355241,"logger":"action","msg":"No restart cluster action","action":"Crdb Cluster Restart","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218689.0407238,"logger":"controller.CrdbCluster","msg":"reconciliation completed","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8642936,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8644145,"logger":"controller.CrdbCluster","msg":"Running action with index: 0 and  name: Decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218733.8644257,"logger":"action","msg":"check decommission oportunities","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.8644555,"logger":"action","msg":"replicas decommisioning","action":"decommission","CrdbCluster":"default/crdb-tls-example","status.CurrentReplicas":4,"expected":3}
{"level":"warn","ts":1622218733.8682542,"logger":"action","msg":"operator is running inside of kubernetes, connecting to service for db connection","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218733.892141,"logger":"action","msg":"opened db connection","action":"decommission","CrdbCluster":"default/crdb-tls-example"}
{"level":"info","ts":1622218733.9547343,"logger":"action","msg":"established statefulset watch","action":"decommission","name":"crdb-tls-example","namespace":"default"}
{"level":"warn","ts":1622218733.9649646,"logger":"action","msg":"scaling down stateful set","action":"decommission","have":4,"want":3}
{"level":"info","ts":1622218734.250925,"logger":"action","msg":"draining node","action":"decommission","NodeID":4}
{"level":"error","ts":1622218735.5067515,"logger":"action","msg":"decomission failed","action":"decommission","CrdbCluster":"default/crdb-tls-example","error":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1","errorVerbose":"failed to start draining node 4: failed to stream execution results back: command terminated with exit code 1\n(1) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:208\n  | [...repeated from below...]\nWraps: (2) failed to start draining node 4\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.CockroachExecutor.Exec\n  | \tpkg/scale/executor.go:57\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).executeDrainCmd\n  | \tpkg/scale/drainer.go:207\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*CockroachNodeDrainer).Decommission\n  | \tpkg/scale/drainer.go:86\n  | github.com/cockroachdb/cockroach-operator/pkg/scale.(*Scaler).EnsureScale\n  | \tpkg/scale/scale.go:87\n  | github.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n  | \tpkg/actor/decommission.go:169\n  | github.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n  | \tpkg/controller/cluster_controller.go:130\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\n  | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n  | \texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\n  | k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntil\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\n  | k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\n  | k8s.io/apimachinery/pkg/util/wait.UntilWithContext\n  | \texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\n  | runtime.goexit\n  | \tsrc/runtime/asm_amd64.s:1371\nWraps: (4) failed to stream execution results back\nWraps: (5) command terminated with exit code 1\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) exec.CodeExitError","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:171\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1622218735.5077403,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218735.5077748,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}

I think the problem is the opposite to @chrisseto - I think another actor is running and that other actor is stopping the pod because it sees that nodes = 3 in the CR instead of nodes =4 and thinks it should shut it down. But because decommission is running, the decommission fails and the decommission never gets started back up. Which is then having the node showing up as failed.

If I'm reading this right, it's this that's stopping the pod that we're waiting to decommission:

{"level":"info","ts":1622218689.0158727,"logger":"controller.CrdbCluster","msg":"Running action with index: 3 and  name: PartialUpdate","CrdbCluster":"default/crdb-tls-example"}
{"level":"warn","ts":1622218689.0158768,"logger":"action","msg":"checking update opportunities, using a partitioned update","action":"partitionedUpdate","CrdbCluster":"default/crdb-tls-example"}

Decommission should be a blocking operation - ie the operator should not to any other work until the decommission is complete. And if the decommission fails we shouldn't allow the PVC pruner to run.

Jun 01 '21 21:06 keith-mcclellan

@udnay this needs manual testing, but removing release blocker

Jun 17 '21 15:06 chrislovecnm

Any updates about it? I'm still faced the same issue in v22.1.2