docker-vernemq icon indicating copy to clipboard operation
docker-vernemq copied to clipboard

VerneMQ reports as healthy before cluster sync is complete

Open PSanetra opened this issue 3 years ago • 1 comments

We are running VerneMQ in a 2-node cluster in Kubernetes. On a restart of a VerneMQ instance we are purging the state of that node, because it is also leaving the cluster on shutdown. (This behavior is hardcoded in the entrypoint script of the image.)

After the restart the instance needs to sync some state with the other node. We would expect that the restarted node would only report as healthy after synchronizing the necessary state from the other node. In reality it is not necessarily finished with synchronizing the state when it begins with reporting as healthy.

This leads to problems when Kubernetes is executing rolling updates. It will shut down the second node before it has finished syncing its state to the just updated node.

A symptom that we see as the result of this behavior is that the admin api-keys are sometimes gone after a rolling update of the VerneMQ StatefulSet.

I can reproduce the problem with the following docker-compose file and shell script. I am executing the vernemq ping command manually, which is also used for the docker health checks. I guess the problem is easier to reproduce if there is more state (like retained messages) to sync:

version: "2.4"
services:
  vmq0:
    image: vernemq/vernemq:1.10.4.1-alpine
    environment:
      DOCKER_VERNEMQ_ACCEPT_EULA: "yes"
  vmq1:
    image: vernemq/vernemq:1.10.4.1-alpine
    depends_on:
      - vmq0
    environment:
      DOCKER_VERNEMQ_ACCEPT_EULA: "yes"
      DOCKER_VERNEMQ_DISCOVERY_NODE: vmq0
      DOCKER_VERNEMQ_COMPOSE: 1
  vmq2:
    image: vernemq/vernemq:1.10.4.1-alpine
    scale: 0
    environment:
      DOCKER_VERNEMQ_ACCEPT_EULA: "yes"
      DOCKER_VERNEMQ_DISCOVERY_NODE: vmq1
      DOCKER_VERNEMQ_COMPOSE: 1

#!/bin/bash -e

docker-compose up -d

echo "Waiting for vmq0"
until docker-compose exec -T vmq0 vernemq ping; do
  sleep 1
done

echo "Waiting for vmq1"
until docker-compose exec -T vmq1 vernemq ping; do
  sleep 1
done

set -x

docker-compose exec -T vmq0 vmq-admin api-key create

docker-compose exec -T vmq0 vmq-admin api-key show

docker-compose exec -T vmq1 vmq-admin api-key show

docker-compose up -d --scale vmq0=0

docker-compose up -d --scale vmq2=1 --scale vmq0=0

set +x

echo "Waiting for vmq2"
# We need too wait a little before we can use the vernemq ping command?!
sleep 2
until docker-compose exec -T vmq2 vernemq ping; do
  sleep 1
done

docker-compose up -d --scale vmq0=0 --scale vmq1=0 --scale vmq2=1

echo "The following command will not print the api-key anymore:"

set -x

docker-compose exec -T vmq2 vmq-admin api-key show

PSanetra avatar Oct 20 '20 08:10 PSanetra

@PSanetra thanks, stuff to think about!

I just opened this https://github.com/vernemq/docker-vernemq/issues/254 a couple minutes ago, pondering whether we should not make stopped nodes leave in the bin script.

The ping command pretty much only checks whether the VerneMQ node is reachable. No conclusions about the synced state.

But yeah, this is obviously an issue. It has to be resolved at the "giving" node, by delaying the termination of that node. From the top of my head, I'm unclear about the extent of logic that needs to be added to Verne itself for that. Maybe not much.

From Kubernetes side I've seen the wildest stuff, like spawning intermittent state draining pods etc (I mean in general, not for Verne).

EDIT: There's also an interesting point here: how do we determine fully replicated state. Verne nodes are constantly replicating with an anti-entropy protocol. This is different than say RabbitMQ that, in a similar reboot case, chooses a disk based node and copies that state. Checking how Rabbit does resolve similar issues might still be a good idea here.

ioolkos avatar Oct 20 '20 08:10 ioolkos