website icon indicating copy to clipboard operation
website copied to clipboard

Etcd snapshot save/restore documentation needs enhancement

Open PenelopeFudd opened this issue 2 years ago • 3 comments

I have a three-node etcd cluster (used with Patroni), and one node decided to break.

Read through https://etcd.io/docs/v3.5/op-guide/recovery/ and wasn't able to get the node to work.

Errors:

2023-04-21 14:41:13.820783 I | raft: 7088d2ed19f2af13 became follower at term 97778
2023-04-21 14:41:13.820946 C | raft: tocommit(28704400) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(28704400) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

Did a bunch of googling, found lots related to Kubernetes, found a few for standalone clusters and/or older versions.

Finally figured out my problems (etcd-related, anyway):

  • On a good node:
$ sudo env ETCDCTL_API=3 etcdctl snapshot save /tmp/snapshot.db
$ scp /tmp/snapshot.db badnode:/tmp
  • On the bad node:
$ sudo mv /var/lib/etcd/default /var/lib/etcd/default.$(date +%s)
$ sudo env ETCDCTL_API=3 etcdctl snapshot restore --data-dir /var/lib/etcd/default /tmp/snapshot.db 
$ sudo chown -R etcd:etcd /var/lib/etcd/default
$ sudo systemctl restart etcd
$ sudo etcdctl member list

Notes:

  • Taking the snapshot was easy; loved it.
  • Restoring the snapshot was hard:
    • The data directory has to not exist first (though the error message made that clear)
    • The ETCDCTL_API=3 environment variable shouldn't be necessary, it looks antiquated
    • The --data-dir value was gleaned from looking at a good node
      • Had to clean up after trial-and-erroring that a few times! :-)
    • The resulting files have to be owned by etcd:etcd
      • The error message seen with journalctl was: etcd[241163]: error listing data dir: /var/lib/etcd/default
      • If the error message could be changed to etcd[241163]: cannot access '/var/lib/etcd/default': Permission denied it'd be appreciated. :-)
    • The /etc/default/etcd file contained a number of environment variables that etcd uses
      • Glad I didn't have to restore that; it's different on each node.
      • If the contents of that file could also be included in the snapshot somehow, that'd be nice.
      • Not sure what to do when restoring it.

Environment: Ubuntu 22.04, etcd 3.2.26

Thanks for a great program!

PenelopeFudd avatar Apr 21 '23 22:04 PenelopeFudd

Thanks for raising this detailed report @PenelopeFudd. Reviewing your notes I agree there are some areas we could expand the recovery documentation, namely ensure people are aware of the directory creation and ownership requirements.

I also like the suggestion on the more meaningful error message for snapshot restore permission issues, that would need to be completed in the main etcd repo. However I do note you're using etcd 3.2.26 which is quite old, we would need to verify if that error has already been improved in later releases.

jmhbnz avatar Apr 22 '23 06:04 jmhbnz

Thanks for reporting @PenelopeFudd and +1 @jmhbnz - v3.2 is not supported so it would be great if you can try v3.5 (the doc you are using) or main branch.

spzala avatar Apr 22 '23 17:04 spzala