`received full heartbeat request addressed to node with different revision` after rolling restart with ephemeral storage

Open vuldin opened this issue 1 year ago • 1 comments

What happened?

Running kubectl rollout restart sts redpanda -n redpanda after deploying with ephemeral storage results in an unhealthy cluster, with the destroyed broker remaining in the cluster and the new broker getting assigned a new node ID (increasing the broker count by 1).

What did you expect to happen?

Expected the cluster to restart and result in a healthy cluster.

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

Create kind cluster:

kind create cluster --name jlp-cluster --config ~/projects/redpanda/kind-config.yaml

Create Redpanda config with ephemeral storage:

cat << EOF > values.yaml
tls:
  enabled: false
storage:
  persistentVolume:
    enabled: false
EOF

Deploy Redpanda 24.1.8 via helm 5.8.12:

helm install redpanda redpanda --repo https://charts.redpanda.com -n redpanda --wait --create-namespace --set version=5.8.12 --set image.tag=v24.1.8 -f values.yaml

Once the cluster is healthy, do a rolling restart:

kubectl rollout restart sts redpanda -n redpanda

Continuously run the following command until redpanda-2 is available:

kubectl logs pod/redpanda-2 -n redpanda -f

Eventually you will see print constantly:

WARN  2024-07-19 18:46:48,917 [shard 0:raft] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:3922 - received full heartbeat request addressed to node with different revision: {id: 2, revision: 0}, current node: {id: 3, revision: 0}, source: {id: 1, revision: 0}

I've ran through this multiple times. Sometimes the rolling restart doesn't continue past redpanda-2. Other times it continues as expected. Most times the cluster ends in the following state, where redpanda-2 is assigned a new node ID and redpanda-1 never returns to the cluster:

> kubectl exec -it redpanda-2 -n redpanda -c redpanda -- rpk cluster info
CLUSTER
=======
redpanda.a0eeca84-6e4c-44cc-b32c-3238b20a8679

BROKERS
=======
ID    HOST                                             PORT
0*    redpanda-0.redpanda.redpanda.svc.cluster.local.  9093
3     redpanda-2.redpanda.redpanda.svc.cluster.local.  9093

TOPICS
======
NAME      PARTITIONS  REPLICAS
_schemas  1           3

> kubectl exec -it redpanda-2 -n redpanda -c redpanda -- rpk cluster health
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          false
Unhealthy reasons:                [leaderless_partitions nodes_down under_replicated_partitions]
Controller ID:                    0
All nodes:                        [0 1 2 3]
Nodes down:                       [1]
Leaderless partitions (1):        [kafka/_schemas/0]
Under-replicated partitions (1):  [redpanda/controller/0]

Anything else we need to know?

We have this doc for this config, but there is no mention of an issue with being able to restart. It seems that running in this state is never a good idea with this issue, since anytime a broker leaves the cluster the cluster will become unhealthy. The brokers should be decommissioned when using ephemeral storage. We have this doc explaining how to perform a rolling restart, but no mention of any issues when using ephemeral storage.

It would be great if we could disable any changes when users run kubectl rollout restart sts redpanda -n redpanda when they also have storage.persistentVolume.enabled: false.

Which are the affected charts?

Redpanda

Chart Version(s)

This happens with all versions I've tested, from 5.8.12 to 5.7.24.

Cloud provider

none

JIRA Link: K8S-299

Jul 19 '24 19:07 vuldin

Some excellent discussion going on in our internal slack.

The tl;dr is that we (I) don't believe there are use cases outside of simple testing / verification of chart / redpanda behaviors. If anyone has other uses cases, please chime in!

Until then, we'll update the docs and add some red tape to both NOTES.txt and the values.yaml file indicating that the errors seen here are expected behavior.

[ ] @chrisseto file a docs issue

Jul 22 '24 14:07 chrisseto