etcd
etcd copied to clipboard
Inconsistent revision and data after --force-new-cluster
What happened?
After starting etcd with --force-new-cluster, removing the database files off the secondary nodes, and rejoining them to the cluster, the cluster is now in a split-brain state. Reads from the 1st node (that was started with --force-new-cluster) returns different data for some keys than reads from nodes that were deleted and subsequently rejoined to the cluster.
The end result feels identical to https://github.com/etcd-io/etcd/issues/13766, but can be reproduced with a fairly trivial amount of traffic in conjunction with using --force-new-cluster.
Examining the database from the nodes with etcd-dump-db and etc-dump logs show the same event sequence in the WAL, but the db itself shows different values in the keystore. I'm not pasting the WAL dump here but will attach the data-dir from both cluster members.
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\nR_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xd2\x14 \x04*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.153.245\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\f\xb9_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xb9\x19 \a*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.188.103\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"
Also,the datastore for both nodes shows different values for the members and members_removed keys. I'm not sure if this is normal or not:
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed
key="77aa3673d9e0e2", value="removed"
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed
brandond@dev01:~/etcd-split-brain$
What did you expect to happen?
Consistent data returned by both cluster members.
How can we reproduce it (as minimally and precisely as possible)?
Start etcd with --force-new-cluster while a running Kubernetes apiserver is pointed at the etcd server. I have not been able to reproduce this ad-hoc with direct writes to a single key.
Anything else we need to know?
No response
Etcd version (please run commands below)
ubuntu@ip-172-31-17-205:~$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64
ubuntu@ip-172-31-17-205:~$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5
Etcd configuration (command line flags or environment variables)
advertise-client-urls: https://172.31.17.205:2379
client-transport-security:
cert-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.crt
client-cert-auth: true
key-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.key
trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
data-dir: /var/lib/rancher/rke2/server/db/etcd
election-timeout: 5000
heartbeat-interval: 500
initial-cluster: ip-172-31-30-121-0cbb6287=https://172.31.30.121:2380,ip-172-31-17-205-72ea150e=https://172.31.17.205:2380
initial-cluster-state: existing
listen-client-urls: https://172.31.17.205:2379,https://127.0.0.1:2379
listen-metrics-urls: http://127.0.0.1:2381
listen-peer-urls: https://172.31.17.205:2380
log-outputs:
- stderr
logger: zap
name: ip-172-31-17-205-72ea150e
peer-transport-security:
cert-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt
client-cert-auth: true
key-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key
trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
data-dir from both cluster members: etcd-split-brain.zip
I've done some additional hacking at this an eliminated a couple possibilities:
- ensuring that there are no etcd clients (kubernetes apiserver, for example) or other peer nodes running when the node is restarted with -force-new-cluster does not make any difference.
- Enabling startup and periodic corruption checks does not show anything interesting
However, one interesting finding is that it appears to be possible to take a snapshot from any node in an affected cluster, restore that snapshot, start with --force-new-cluster, and then rejoin the other nodes, and the cluster now returns consistent results.
I did not reproduce this issue with 3.5.4. Please provide the detailed steps and the command you executed in each step.
I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?
I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?
You can try the benchmark tool. See an example command below,
./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=500000000 --key-size=128 --val-size=10240 --total=100000 --rate=10000
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members key="3e8594789d62d712", value="{"id":4505170248011536146,"peerURLs":["https://172.31.17.205:2380"],"isLearner":true}" key="3c0e71035ef2e3ca", value="{"id":4327520551241442250,"peerURLs":["https://172.31.30.121:2380"],"name":"ip-172-31-30-121-53c44a92"}"
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed key="77aa3673d9e0e2", value="removed"
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"
brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed
brandond@dev01:~/etcd-split-brain$
It's interesting that there are two learners on etcd-2 and one learner on etcd-1. Could you provide detailed steps (and with command) on how to reproduce this issue?
I really wish I knew how to reproduce it using just etcd and a bare etcd3 client or the benchmark tool. At the moment I can only reproduce it when I have a Kubernetes cluster pointed at etcd, and use the --force-new-cluster option to reset the cluster membership back to a single node.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.