etcd
etcd copied to clipboard
Failed to restore etcd from a snapshot due to resolving peer URL failure
I have done some procedures for this:
- copy a snapshot file to someplace named etcd-snapshot.db
- scale the statefulset to 0
- start a static pod with etcdctl and mount the pvc used by etcd members
- execute the restore command:
etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --initial-cluster=apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --initial-cluster-token=etcd-cluster-k8s --initial-advertise-peer-urls=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --name apisix-etcd-2 --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data
but the command has a problem: the pods have been shut down,therefore there is no pod-domain exists,get errors:
{"level":"warn","ts":1663053636.7199378,"caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","host":"apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","retry-interval":1,"error":"lookup apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local on 192.168.0.2:53: no such host"}
If I restore without extra options:
etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data
All things are OK except that the node start as a single-node,etcdctl member list only shows itself :(
So, how should I restore ETCD deployed in K8S, thank you in advance.
etcd Version: 3.4.16 Git SHA: d19fbe541 Go Version: go1.12.17 Go OS/Arch: linux/amd64
Thanks @xzycn for raising this ticket. It looks like an issue to me.
The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When etcdctl/etcdutl tries to verify whether --initial-advertise-peer-urls matches the its URL included in --initial-cluster, it may need to resolve the TCPAddress.
The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.
It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.
The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When
etcdctl/etcdutltries to verify whether--initial-advertise-peer-urlsmatches the its URL included in--initial-cluster, it may need to resolve the TCPAddress.The proposed fix is to add a flag something like "
--ignore-bootstrap-verify" foretcdutlto bypass the issue for the case of restoring snapshot.
Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?
It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.
I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?
[1] https://etcd.io/docs/v3.5/op-guide/recovery/
Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?
Because the etcd POD isn't running when restoring from the snapshot, so the URL something like apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local can't be resolved. Please refer to the reporter's description above.
I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?
Please feel free to deliver a PR. Please follow https://github.com/etcd-io/etcd/issues/14456#issuecomment-1248645358 to reproduce and fix this issue. I think command alone should be enough to reproduce and fix this issue. But eventually we need to verify the real scenario raised by the reporter (@xzycn ).
@ahrtr This command comes from the helm chart https://github.com/apache/apisix-helm-chart/tree/master/charts/apisix/charts,etcd is the subchart of chart called apisix.
My below comment is wrt 3.5.*
I did hit the issue. Adding a work around for time being without touching the chart.
Assume you have snapshot and etcd cluster is down.
Steps: 1. Bring up the etcd cluster (etcd1, etcd2, etcd3). lets say the data dir is at /tmp/etcd1/data, /tmp/etcd2/data, /tmp/etcd3/data respectively. (if corrupted, backup the data directories and start afresh) 2. Run the restore command to a new data-directory (--data-dir /tmp/etcd{1,2,3}/data.backup in restore command) say /tmp/etcd1/data.backup, /tmp/etcd2/data.backup, /tmp/etcd3/data.backup 3. Bring down the etcd cluster (lets say if it is kubernetes, make replicas from 3 to 0) 4. mv /tmp/etcd1/data /tmp/etcd1/data.prev mv /tmp/etcd2/data /tmp/etcd2/data.prev mv /tmp/etcd3/data /tmp/etcd3/data.prev 5. mv /tmp/etcd1/data.backup /tmp/etcd1/data mv /tmp/etcd2/data.backup /tmp/etcd2/data mv /tmp/etcd3/data.backup /tmp/etcd3/data 6. Bring up the etcd cluster (replicas from 0 to 3)
@hasethuraman In your steps,the restore command only with one option “--data-dir”?If so,each member will start as single-node?If not,using an option(e.g. initial-cluster ) with a domain will cause the problem as title describled.
@hasethuraman In your steps,the restore command only with one option “--data-dir”?If so,each member will start as single-node?If not,using a option(e.g. initial-cluster ) with a domain will cause the problem as title describled.
Correct. The restore command arguments I tried is same as in https://etcd.io/docs/v3.3/op-guide/recovery/#restoring-a-cluster
I am having trouble replicating the problem. I created 2 etcd members (static configuration) in a cluster, similar to @xzycn commandline. When I restore it seems to work fine without creating the problem log message. Note that I used etcd version 3.5.5 and etcdutl (rather than etcdctl which is deprecated). I have given the command-line below. It could be because I am using IP addresses rather than hostnames. Also, I noticed that the message is a warning and not fatal, does it prevent etcd from completing ?
etcd Version: 3.5.5
Create cluster (2 such instances)
/tmp/etcd-download-test/etcd --name etcd1 --initial-advertise-peer-urls http://10.160.0.9:2380 \ --listen-peer-urls http://10.160.0.9:2380 \ --listen-client-urls http://10.160.0.9:2379,http://127.0.0.1:2379 \ --advertise-client-urls http://10.160.0.9:2379 \ --initial-cluster-token etcd-cluster-1 \ --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 \ --initial-cluster-state new \ --data-dir /home/prasadc/etcd_data
Create snapshot
etcdctl snapshot save hello.db
restore from snapshot
cat ./restore.sh
/tmp/etcd-download-test/etcdutl snapshot restore --skip-hash-check hello.db --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls http://10.160.0.9:2380 --name etcd1 --data-dir /home/prasadc/etcd_data_restore
./restore.sh
2022-09-21T10:31:04Z info snapshot/v3_snapshot.go:248 restoring snapshot {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:117\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/[email protected]/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/[email protected]/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/[email protected]/command.go:897\nmain.Start\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/ctl.go:50\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/main.go:23\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
2022-09-21T10:31:04Z info membership/store.go:141 Trimming membership information from the backend...
2022-09-21T10:31:04Z info membership/cluster.go:421 added member {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "7012f0c6b3126ac4", "added-peer-peer-urls": ["http://10.160.0.10:2380"]}
2022-09-21T10:31:04Z info membership/cluster.go:421 added member {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "f3655e2d7dd93afe", "added-peer-peer-urls": ["http://10.160.0.9:2380"]}
2022-09-21T10:31:05Z info snapshot/v3_snapshot.go:269 restored snapshot {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap"}
You need to reproduce this issue using unsolvable URL such as http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380
@ahrtr I want to work on this issue. could you please assign me this.
Thanks @sanjeev98kumar
@pchan are you still working on this issue?
@pchan are you still working on this issue?
Yes, I will implement the following part. I expect to have a PR or an update soon.
The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.
Thanks @pchan for the update.
@sanjeev98kumar Please find something else to work on. FYI. find-something-to-work-on
@ahrtr I have created a PR (#14546 ) that attempts to fix this by adding a flag. Can you please review and add reviewers. I wasn't able to follow everything from Contributing guide. It passes make test-unit. I am looking for feedback, will update PR with the rest of the steps specified in Contribution guide.
I just realized that actually the main branch doesn't have this issue, and it can only be reproduced on 3.5 and 3.4. The issue has already been resolved in https://github.com/etcd-io/etcd/commit/b272b98b79a392ac269d5c1577e15b655844ce1a in main branch. Please backport the commit to both 3.5 and 3.4. Thanks.
The original PR is https://github.com/etcd-io/etcd/pull/13224
I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?
Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?
Because the etcd POD isn't running when restoring from the snapshot, so the URL something like
apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.localcan't be resolved. Please refer to the reporter's description above.
The fix that is back-ported front loads the URL comparison between advertise peer (--initial-advertise-peer-urls) and initial cluster (--initial-cluster) so that resolve is not called. So if a user gives different URLs that resolves to the same ip address, the issue will still be manifested and the only way to prevent that is to use the flag. I checked the reporter's description and the backport should be enough.
I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?
Could you double check whether 3.4 have this issue and backport it to release-3.4 as well if needed? thx
Resolved in https://github.com/etcd-io/etcd/pull/14577 and https://github.com/etcd-io/etcd/pull/14573
The fix will be included in 3.5.6 and 3.4.22.
@pchan please add a changelog item for both 3.4 and 3.5. FYI. https://github.com/etcd-io/etcd/pull/14573#issuecomment-1274562134