cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

> Proposed state for all rolling restarts (regardless of cluster size):

Open keith-mcclellan opened this issue 4 years ago • 1 comments

Yes the store will be different for each node. We just need to make sure all the sets have a value of 0.

Proposed state for all rolling restarts (regardless of cluster size):

  1. We update one pod at a time.
  2. cockroach drain node is sent first as a pre-stop hook initiating a graceful drain, after 1m, SIGKILL is sent.
  3. We wait on the k8s readiness (which means /health?ready=1) of ALL pods in the the cluster.
  4. we check _status/vars on all cockroachdb pods looking for pairs like ranges_underreplicated{store="1"} 0 and wait if any are non-zero until all are 0. We can recheck every 10 seconds.
  5. once the above is true we wait x seconds and check a 2nd time. 10 seconds is the period under which we guarantee updated prometheus stats so that should be the minimum. Currently, we wait 1 minute but without an additional health check, in CC the default is 3 minutes. We want the user to be able to set the safety window and will document the tradeoffs on a longer window.
  6. Once this checks complete, we update another pod.

@udnay thoughts? @tbg would love your input as well

@keith-mcclellan after I exec on any pod of the cluster and try to get the ranges_underreplicated metric I noticed that the store label is actually the id of the pod. Just to make sure. Is this correct?

curl -Lk localhost:8080/_status/vars --silent | grep -i  ranges_underreplicated
# HELP ranges_underreplicated Number of ranges with fewer live replicas than the replication target
# TYPE ranges_underreplicated gauge
ranges_underreplicated{store="3"} 0

Originally posted by @alinadonisa in https://github.com/cockroachdb/cockroach-operator/issues/481#issuecomment-842100403

keith-mcclellan avatar May 17 '21 14:05 keith-mcclellan

From what I'm hearing in CC, the 1 minute timeout will not reliably drain the node. But no timeout will; there seem to be cases where we don't drain successfully. While this is the case, a 1 minute timeout seems reasonable.

I just verified that hitting the _status/vars endpoint actually refreshes the metrics, so you can change the 10s in step 4 to any other number (except not too low, wouldn't drop below 5s to be conservative; computing the metrics is slightly expensive).

Have you thought about the error handling in step 4? Absent buglets in CRDB, in step 4 you should never see anything nonzero unless other bad things are happening in the cluster (nodes crashing, etc). Would you retry here forever or eventually abort the upgrade? Though that too means restarting nodes, so it's one of these situations in which every direction you can go hurts, so maybe retrying is the only sane option.

I would fold step 5 into step 4 by saying that you periodically check the underreplicated ranges until it's been all zeroes for the last X minutes.

Looks good overall.

tbg avatar May 17 '21 14:05 tbg