etcd Inconsistent revision and data after --force-new-cluster

trafficstars

What happened?

After starting etcd with --force-new-cluster, removing the database files off the secondary nodes, and rejoining them to the cluster, the cluster is now in a split-brain state. Reads from the 1st node (that was started with --force-new-cluster) returns different data for some keys than reads from nodes that were deleted and subsequently rejoined to the cluster.

The end result feels identical to https://github.com/etcd-io/etcd/issues/13766, but can be reproduced with a fairly trivial amount of traffic in conjunction with using --force-new-cluster.

Examining the database from the nodes with etcd-dump-db and etc-dump logs show the same event sequence in the WAL, but the db itself shows different values in the keystore. I'm not pasting the WAL dump here but will attach the data-dir from both cluster members.

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\nR_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xd2\x14 \x04*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.153.245\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\f\xb9_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xb9\x19 \a*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.188.103\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"

Also,the datastore for both nodes shows different values for the members and members_removed keys. I'm not sure if this is normal or not:

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed
key="77aa3673d9e0e2", value="removed"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed

brandond@dev01:~/etcd-split-brain$

What did you expect to happen?

Consistent data returned by both cluster members.

How can we reproduce it (as minimally and precisely as possible)?

Start etcd with --force-new-cluster while a running Kubernetes apiserver is pointed at the etcd server. I have not been able to reproduce this ad-hoc with direct writes to a single key.

Anything else we need to know?

No response

Etcd version (please run commands below)

ubuntu@ip-172-31-17-205:~$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

ubuntu@ip-172-31-17-205:~$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

advertise-client-urls: https://172.31.17.205:2379
client-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
data-dir: /var/lib/rancher/rke2/server/db/etcd
election-timeout: 5000
heartbeat-interval: 500
initial-cluster: ip-172-31-30-121-0cbb6287=https://172.31.30.121:2380,ip-172-31-17-205-72ea150e=https://172.31.17.205:2380
initial-cluster-state: existing
listen-client-urls: https://172.31.17.205:2379,https://127.0.0.1:2379
listen-metrics-urls: http://127.0.0.1:2381
listen-peer-urls: https://172.31.17.205:2380
log-outputs:
- stderr
logger: zap
name: ip-172-31-17-205-72ea150e
peer-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

data-dir from both cluster members: etcd-split-brain.zip

May 04 '22 23:05 brandond

I've done some additional hacking at this an eliminated a couple possibilities:

ensuring that there are no etcd clients (kubernetes apiserver, for example) or other peer nodes running when the node is restarted with -force-new-cluster does not make any difference.
Enabling startup and periodic corruption checks does not show anything interesting

However, one interesting finding is that it appears to be possible to take a snapshot from any node in an affected cluster, restore that snapshot, start with --force-new-cluster, and then rejoin the other nodes, and the cluster now returns consistent results.

May 05 '22 21:05 brandond

I did not reproduce this issue with 3.5.4. Please provide the detailed steps and the command you executed in each step.

May 06 '22 01:05 ahrtr

I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?

May 06 '22 02:05 brandond

I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?

You can try the benchmark tool. See an example command below,

./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=500000000 --key-size=128 --val-size=10240  --total=100000 --rate=10000

May 08 '22 01:05 ahrtr

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members key="3e8594789d62d712", value="{"id":4505170248011536146,"peerURLs":["https://172.31.17.205:2380"],"isLearner":true}" key="3c0e71035ef2e3ca", value="{"id":4327520551241442250,"peerURLs":["https://172.31.30.121:2380"],"name":"ip-172-31-30-121-53c44a92"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed key="77aa3673d9e0e2", value="removed"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed

brandond@dev01:~/etcd-split-brain$

It's interesting that there are two learners on etcd-2 and one learner on etcd-1. Could you provide detailed steps (and with command) on how to reproduce this issue?

Sep 08 '22 08:09 ahrtr

I really wish I knew how to reproduce it using just etcd and a bare etcd3 client or the benchmark tool. At the moment I can only reproduce it when I have a Kubernetes cluster pointed at etcd, and use the --force-new-cluster option to reset the cluster membership back to a single node.

Sep 08 '22 10:09 brandond

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Dec 31 '22 23:12 stale[bot]

etcd etcd copied to clipboard

Inconsistent revision and data after --force-new-cluster

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

etcd
etcd copied to clipboard