radondb-mysql-kubernetes icon indicating copy to clipboard operation
radondb-mysql-kubernetes copied to clipboard

[bug] update too many pods at the same time.

Open runkecheng opened this issue 3 years ago • 2 comments
trafficstars

Describe the problem

There are two nodes(3 nodes cluster) to be deleted and restarted when updating configuration. its leads to a cluster to be temporarily unavailable, The correct situation should be only one node update at the same time.

To Reproduce

The default PodDisruptionBudget is 50%, so the minimum available node of the 3-node cluster is 2, but in fact, when the configuration is updated, the two nodes will update at once, which does not match the PDB.

root cause:

The StatefulSetUpdateStrategy is OnDelete, delete POD by logic as follows,

	if pod.ObjectMeta.Labels["controller-revision-hash"] == s.sfs.Status.UpdateRevision {
		log.Info("pod is already updated", "pod name", pod.Name)
	} else {
                ...
		if pod.DeletionTimestamp != nil {
			log.Info("pod is being deleted", "pod", pod.Name, "key", s.Unwrap())
		} else {
			if err := s.cli.Delete(ctx, pod); err != nil {
				return err
			}
		}
	}

after delete a pod, retry will exit in advance because the health tag of the node being deleted is still yes. The correct logic is waiting for the deleted POD re-readiness and update the next POD.

	if pod.ObjectMeta.Labels["healthy"] == "yes" &&
		pod.ObjectMeta.Labels["controller-revision-hash"] != s.sfs.Status.UpdateRevision {
		return false, fmt.Errorf("pod %s is ready, wait next schedule", pod.Name)
	}

Expected behavior

Environment:

  • RadonDB MySQL version:

runkecheng avatar Nov 25 '21 03:11 runkecheng

d9397238b3b35fae7f01ee8d4915742

runkecheng avatar Nov 25 '21 03:11 runkecheng

The pod obtained in Retry () may not be the latest.

Need to check the additional DeletionTimestamp.

If the pod is deleted, health should be no and skipped other checks.

f64ff95437357b06e85e36498fb144a

runkecheng avatar May 27 '22 07:05 runkecheng