website
website copied to clipboard
Etcd snapshot save/restore documentation needs enhancement
I have a three-node etcd cluster (used with Patroni), and one node decided to break.
Read through https://etcd.io/docs/v3.5/op-guide/recovery/ and wasn't able to get the node to work.
Errors:
2023-04-21 14:41:13.820783 I | raft: 7088d2ed19f2af13 became follower at term 97778
2023-04-21 14:41:13.820946 C | raft: tocommit(28704400) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(28704400) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
Did a bunch of googling, found lots related to Kubernetes, found a few for standalone clusters and/or older versions.
Finally figured out my problems (etcd-related, anyway):
- On a good node:
$ sudo env ETCDCTL_API=3 etcdctl snapshot save /tmp/snapshot.db
$ scp /tmp/snapshot.db badnode:/tmp
- On the bad node:
$ sudo mv /var/lib/etcd/default /var/lib/etcd/default.$(date +%s)
$ sudo env ETCDCTL_API=3 etcdctl snapshot restore --data-dir /var/lib/etcd/default /tmp/snapshot.db
$ sudo chown -R etcd:etcd /var/lib/etcd/default
$ sudo systemctl restart etcd
$ sudo etcdctl member list
Notes:
- Taking the snapshot was easy; loved it.
- Restoring the snapshot was hard:
- The data directory has to not exist first (though the error message made that clear)
- The
ETCDCTL_API=3environment variable shouldn't be necessary, it looks antiquated - The
--data-dirvalue was gleaned from looking at a good node- Had to clean up after trial-and-erroring that a few times! :-)
- The resulting files have to be owned by etcd:etcd
- The error message seen with
journalctlwas:etcd[241163]: error listing data dir: /var/lib/etcd/default - If the error message could be changed to
etcd[241163]: cannot access '/var/lib/etcd/default': Permission deniedit'd be appreciated. :-)
- The error message seen with
- The
/etc/default/etcdfile contained a number of environment variables that etcd uses- Glad I didn't have to restore that; it's different on each node.
- If the contents of that file could also be included in the snapshot somehow, that'd be nice.
- Not sure what to do when restoring it.
Environment: Ubuntu 22.04, etcd 3.2.26
Thanks for a great program!
Thanks for raising this detailed report @PenelopeFudd. Reviewing your notes I agree there are some areas we could expand the recovery documentation, namely ensure people are aware of the directory creation and ownership requirements.
I also like the suggestion on the more meaningful error message for snapshot restore permission issues, that would need to be completed in the main etcd repo. However I do note you're using etcd 3.2.26 which is quite old, we would need to verify if that error has already been improved in later releases.
Thanks for reporting @PenelopeFudd and +1 @jmhbnz - v3.2 is not supported so it would be great if you can try v3.5 (the doc you are using) or main branch.