docker-vernemq icon indicating copy to clipboard operation
docker-vernemq copied to clipboard

Make Verne node stop only instead of making it leave

Open ioolkos opened this issue 4 years ago • 5 comments

When a Pod running VerneMQ gets a SIGTERM from Kubernetes, it'll stop the VerneMQ node but also make it leave the cluster: https://github.com/vernemq/docker-vernemq/blob/21497c31d02b990a8d0522719b17dc7be08fe78d/bin/vernemq.sh#L160

This is probably not what we want in case Pods get rescheduled. In that case, we'd rather want those nodes to be stopped and started only (using the same local persistent state).

Changing this will introduce a problem for the "scaling down" case: the stopped cluster node(s) won't be forgotten by the remaining cluster. (but we could make them forget with vmq-admin calls).

Any thoughts?

ioolkos avatar Oct 20 '20 08:10 ioolkos

@ioolkos We're running some tests on Kubernetes with a 3-node VerneMQ cluster that lead to the issue described in vernemq/vernemq#1659

Following your suggestion we have built a custom VerneMQ Docker image commeting out the line that instructs the node to leave the cluster before shutdown. Executing the same tests we were not able to reproduce the issue anymore.

We think that this modification can helps to better manage VerneMQ deployments on k8s. A drawback is for sure the need to manually leave a node on the cluster if the scale down action is intended.

Anyway, we still noted that after a scale down/up/down (3->2->3) the subscriptions number for each node indicated on HTTP status page differ.

pandvan avatar Nov 10 '20 15:11 pandvan

@pandvan thanks for your testing!

Do you have more information on the subscriptions? do you something like a minimal reproducible case for that?

ioolkos avatar Nov 12 '20 10:11 ioolkos

At the moment we're more focusing on stressing the broker cluster with a lot of programmatic simulated clients, rather than scaling the cluster size.

Sorry but we have no more information about subscriptions because we haven't seen thas issue anymore. Or better, after few seconds, when happened, the subs number for each broker turn back equal again. Could it be possible? Anyway, we never faced the message delivery problem after modificaton of SIGTERM handler script.

pandvan avatar Nov 12 '20 11:11 pandvan

We faced the same issue when a pod was being deleted for any reason. After commenting out the link that @ioolkos suggests above we have verified that this is not the case anymore. Are you aware of any cases that this "cluster leave" behaviour should exist or a PR should follow by dropping this line?

offtopic: @pandvan are you using an open source tool for this? I am asking because we are trying to do the same.

angeloskaltsikis avatar Nov 27 '20 16:11 angeloskaltsikis

@angeloskaltsikis PR will follow. Well, you would want a cluster leave, in case you tell K8s to scale down the cluster.

ioolkos avatar Nov 27 '20 16:11 ioolkos