docker-vernemq
docker-vernemq copied to clipboard
VerneMQ reports as healthy before cluster sync is complete
We are running VerneMQ in a 2-node cluster in Kubernetes. On a restart of a VerneMQ instance we are purging the state of that node, because it is also leaving the cluster on shutdown. (This behavior is hardcoded in the entrypoint script of the image.)
After the restart the instance needs to sync some state with the other node. We would expect that the restarted node would only report as healthy after synchronizing the necessary state from the other node. In reality it is not necessarily finished with synchronizing the state when it begins with reporting as healthy.
This leads to problems when Kubernetes is executing rolling updates. It will shut down the second node before it has finished syncing its state to the just updated node.
A symptom that we see as the result of this behavior is that the admin api-keys are sometimes gone after a rolling update of the VerneMQ StatefulSet.
I can reproduce the problem with the following docker-compose file and shell script. I am executing the vernemq ping
command manually, which is also used for the docker health checks. I guess the problem is easier to reproduce if there is more state (like retained messages) to sync:
version: "2.4"
services:
vmq0:
image: vernemq/vernemq:1.10.4.1-alpine
environment:
DOCKER_VERNEMQ_ACCEPT_EULA: "yes"
vmq1:
image: vernemq/vernemq:1.10.4.1-alpine
depends_on:
- vmq0
environment:
DOCKER_VERNEMQ_ACCEPT_EULA: "yes"
DOCKER_VERNEMQ_DISCOVERY_NODE: vmq0
DOCKER_VERNEMQ_COMPOSE: 1
vmq2:
image: vernemq/vernemq:1.10.4.1-alpine
scale: 0
environment:
DOCKER_VERNEMQ_ACCEPT_EULA: "yes"
DOCKER_VERNEMQ_DISCOVERY_NODE: vmq1
DOCKER_VERNEMQ_COMPOSE: 1
#!/bin/bash -e
docker-compose up -d
echo "Waiting for vmq0"
until docker-compose exec -T vmq0 vernemq ping; do
sleep 1
done
echo "Waiting for vmq1"
until docker-compose exec -T vmq1 vernemq ping; do
sleep 1
done
set -x
docker-compose exec -T vmq0 vmq-admin api-key create
docker-compose exec -T vmq0 vmq-admin api-key show
docker-compose exec -T vmq1 vmq-admin api-key show
docker-compose up -d --scale vmq0=0
docker-compose up -d --scale vmq2=1 --scale vmq0=0
set +x
echo "Waiting for vmq2"
# We need too wait a little before we can use the vernemq ping command?!
sleep 2
until docker-compose exec -T vmq2 vernemq ping; do
sleep 1
done
docker-compose up -d --scale vmq0=0 --scale vmq1=0 --scale vmq2=1
echo "The following command will not print the api-key anymore:"
set -x
docker-compose exec -T vmq2 vmq-admin api-key show
@PSanetra thanks, stuff to think about!
I just opened this https://github.com/vernemq/docker-vernemq/issues/254 a couple minutes ago, pondering whether we should not make stopped nodes leave in the bin script.
The ping
command pretty much only checks whether the VerneMQ node is reachable. No conclusions about the synced state.
But yeah, this is obviously an issue. It has to be resolved at the "giving" node, by delaying the termination of that node. From the top of my head, I'm unclear about the extent of logic that needs to be added to Verne itself for that. Maybe not much.
From Kubernetes side I've seen the wildest stuff, like spawning intermittent state draining pods etc (I mean in general, not for Verne).
EDIT: There's also an interesting point here: how do we determine fully replicated state. Verne nodes are constantly replicating with an anti-entropy protocol. This is different than say RabbitMQ that, in a similar reboot case, chooses a disk based node and copies that state. Checking how Rabbit does resolve similar issues might still be a good idea here.