etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Failed to restore etcd from a snapshot due to resolving peer URL failure

Open xzycn opened this issue 3 years ago • 9 comments

I have done some procedures for this:

  1. copy a snapshot file to someplace named etcd-snapshot.db
  2. scale the statefulset to 0
  3. start a static pod with etcdctl and mount the pvc used by etcd members
  4. execute the restore command:
etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --initial-cluster=apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --initial-cluster-token=etcd-cluster-k8s --initial-advertise-peer-urls=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --name apisix-etcd-2  --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data

but the command has a problem: the pods have been shut down,therefore there is no pod-domain exists,get errors:

{"level":"warn","ts":1663053636.7199378,"caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","host":"apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","retry-interval":1,"error":"lookup apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local on 192.168.0.2:53: no such host"}

If I restore without extra options:

etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data

All things are OK except that the node start as a single-node,etcdctl member list only shows itself :(

So, how should I restore ETCD deployed in K8S, thank you in advance.

etcd Version: 3.4.16 Git SHA: d19fbe541 Go Version: go1.12.17 Go OS/Arch: linux/amd64

xzycn avatar Sep 13 '22 10:09 xzycn

Thanks @xzycn for raising this ticket. It looks like an issue to me.

The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When etcdctl/etcdutl tries to verify whether --initial-advertise-peer-urls matches the its URL included in --initial-cluster, it may need to resolve the TCPAddress.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.

ahrtr avatar Sep 15 '22 21:09 ahrtr

The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When etcdctl/etcdutl tries to verify whether --initial-advertise-peer-urls matches the its URL included in --initial-cluster, it may need to resolve the TCPAddress.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.

I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?

[1] https://etcd.io/docs/v3.5/op-guide/recovery/

pchan avatar Sep 16 '22 12:09 pchan

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

Because the etcd POD isn't running when restoring from the snapshot, so the URL something like apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local can't be resolved. Please refer to the reporter's description above.

I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?

Please feel free to deliver a PR. Please follow https://github.com/etcd-io/etcd/issues/14456#issuecomment-1248645358 to reproduce and fix this issue. I think command alone should be enough to reproduce and fix this issue. But eventually we need to verify the real scenario raised by the reporter (@xzycn ).

ahrtr avatar Sep 16 '22 22:09 ahrtr

@ahrtr This command comes from the helm chart https://github.com/apache/apisix-helm-chart/tree/master/charts/apisix/charts,etcd is the subchart of chart called apisix.

xzycn avatar Sep 17 '22 07:09 xzycn

My below comment is wrt 3.5.*

I did hit the issue. Adding a work around for time being without touching the chart.

Assume you have snapshot and etcd cluster is down.

Steps: 1. Bring up the etcd cluster (etcd1, etcd2, etcd3). lets say the data dir is at /tmp/etcd1/data, /tmp/etcd2/data, /tmp/etcd3/data respectively. (if corrupted, backup the data directories and start afresh) 2. Run the restore command to a new data-directory (--data-dir /tmp/etcd{1,2,3}/data.backup in restore command) say /tmp/etcd1/data.backup, /tmp/etcd2/data.backup, /tmp/etcd3/data.backup 3. Bring down the etcd cluster (lets say if it is kubernetes, make replicas from 3 to 0) 4. mv /tmp/etcd1/data /tmp/etcd1/data.prev mv /tmp/etcd2/data /tmp/etcd2/data.prev mv /tmp/etcd3/data /tmp/etcd3/data.prev 5. mv /tmp/etcd1/data.backup /tmp/etcd1/data mv /tmp/etcd2/data.backup /tmp/etcd2/data mv /tmp/etcd3/data.backup /tmp/etcd3/data 6. Bring up the etcd cluster (replicas from 0 to 3)

hasethuraman avatar Sep 19 '22 12:09 hasethuraman

@hasethuraman In your steps,the restore command only with one option “--data-dir”?If so,each member will start as single-node?If not,using an option(e.g. initial-cluster ) with a domain will cause the problem as title describled.

xzycn avatar Sep 19 '22 15:09 xzycn

@hasethuraman In your steps,the restore command only with one option “--data-dir”?If so,each member will start as single-node?If not,using a option(e.g. initial-cluster ) with a domain will cause the problem as title describled.

Correct. The restore command arguments I tried is same as in https://etcd.io/docs/v3.3/op-guide/recovery/#restoring-a-cluster

hasethuraman avatar Sep 20 '22 05:09 hasethuraman

I am having trouble replicating the problem. I created 2 etcd members (static configuration) in a cluster, similar to @xzycn commandline. When I restore it seems to work fine without creating the problem log message. Note that I used etcd version 3.5.5 and etcdutl (rather than etcdctl which is deprecated). I have given the command-line below. It could be because I am using IP addresses rather than hostnames. Also, I noticed that the message is a warning and not fatal, does it prevent etcd from completing ?

etcd Version: 3.5.5

Create cluster (2 such instances)

/tmp/etcd-download-test/etcd --name etcd1 --initial-advertise-peer-urls http://10.160.0.9:2380 \ --listen-peer-urls http://10.160.0.9:2380 \ --listen-client-urls http://10.160.0.9:2379,http://127.0.0.1:2379 \ --advertise-client-urls http://10.160.0.9:2379 \ --initial-cluster-token etcd-cluster-1 \ --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 \ --initial-cluster-state new \ --data-dir /home/prasadc/etcd_data

Create snapshot

etcdctl snapshot save hello.db

restore from snapshot

cat ./restore.sh

/tmp/etcd-download-test/etcdutl snapshot restore --skip-hash-check hello.db --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls http://10.160.0.9:2380 --name etcd1  --data-dir /home/prasadc/etcd_data_restore

./restore.sh
2022-09-21T10:31:04Z    info    snapshot/v3_snapshot.go:248     restoring snapshot      {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:117\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/[email protected]/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/[email protected]/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/[email protected]/command.go:897\nmain.Start\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/ctl.go:50\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/main.go:23\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
2022-09-21T10:31:04Z    info    membership/store.go:141 Trimming membership information from the backend...
2022-09-21T10:31:04Z    info    membership/cluster.go:421       added member    {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "7012f0c6b3126ac4", "added-peer-peer-urls": ["http://10.160.0.10:2380"]}
2022-09-21T10:31:04Z    info    membership/cluster.go:421       added member    {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "f3655e2d7dd93afe", "added-peer-peer-urls": ["http://10.160.0.9:2380"]}
2022-09-21T10:31:05Z    info    snapshot/v3_snapshot.go:269     restored snapshot       {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap"}

pchan avatar Sep 21 '22 10:09 pchan

You need to reproduce this issue using unsolvable URL such as http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380

ahrtr avatar Sep 22 '22 08:09 ahrtr

@ahrtr I want to work on this issue. could you please assign me this.

sanjeev98kumar avatar Sep 25 '22 18:09 sanjeev98kumar

Thanks @sanjeev98kumar

@pchan are you still working on this issue?

ahrtr avatar Sep 25 '22 20:09 ahrtr

@pchan are you still working on this issue?

Yes, I will implement the following part. I expect to have a PR or an update soon.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

pchan avatar Sep 26 '22 01:09 pchan

Thanks @pchan for the update.

@sanjeev98kumar Please find something else to work on. FYI. find-something-to-work-on

ahrtr avatar Sep 26 '22 02:09 ahrtr

@ahrtr I have created a PR (#14546 ) that attempts to fix this by adding a flag. Can you please review and add reviewers. I wasn't able to follow everything from Contributing guide. It passes make test-unit. I am looking for feedback, will update PR with the rest of the steps specified in Contribution guide.

pchan avatar Oct 03 '22 12:10 pchan

I just realized that actually the main branch doesn't have this issue, and it can only be reproduced on 3.5 and 3.4. The issue has already been resolved in https://github.com/etcd-io/etcd/commit/b272b98b79a392ac269d5c1577e15b655844ce1a in main branch. Please backport the commit to both 3.5 and 3.4. Thanks.

ahrtr avatar Oct 03 '22 23:10 ahrtr

The original PR is https://github.com/etcd-io/etcd/pull/13224

ahrtr avatar Oct 03 '22 23:10 ahrtr

I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?

pchan avatar Oct 11 '22 11:10 pchan

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

Because the etcd POD isn't running when restoring from the snapshot, so the URL something like apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local can't be resolved. Please refer to the reporter's description above.

The fix that is back-ported front loads the URL comparison between advertise peer (--initial-advertise-peer-urls) and initial cluster (--initial-cluster) so that resolve is not called. So if a user gives different URLs that resolves to the same ip address, the issue will still be manifested and the only way to prevent that is to use the flag. I checked the reporter's description and the backport should be enough.

pchan avatar Oct 11 '22 11:10 pchan

I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?

Could you double check whether 3.4 have this issue and backport it to release-3.4 as well if needed? thx

ahrtr avatar Oct 11 '22 11:10 ahrtr

Resolved in https://github.com/etcd-io/etcd/pull/14577 and https://github.com/etcd-io/etcd/pull/14573

The fix will be included in 3.5.6 and 3.4.22.

@pchan please add a changelog item for both 3.4 and 3.5. FYI. https://github.com/etcd-io/etcd/pull/14573#issuecomment-1274562134

ahrtr avatar Oct 12 '22 10:10 ahrtr