etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Duplicate names in `ETCD_INITIAL_CLUSTER` not handled correctly

Open mortehu opened this issue 3 years ago • 11 comments

What happened?

If you don't pass a --name argument to your etcd processes, they will all have the name default and the cluster will operate normally. However, when you add a member, the generated ETCD_INITIAL_CLUSTER variable will have multiple entries with the name "default". When this environment variable is used, etcd will parse these into a mapping under a single key ("default") with multiple URLs, and create a single member. See https://github.com/etcd-io/etcd/blob/63a1cc3fe40bace6898289dec35a9aad05163889/server/etcdserver/api/membership/cluster.go#L83-L86

This leads to the confusing error message "member count is unequal". The documentation on https://etcd.io/docs/v3.5/op-guide/runtime-configuration/ mentions this failure, but the situation is different.

What did you expect to happen?

Either

a. member add should fail, saying it cannot generate a valid ETCD_INITIAL_CLUSTER due to duplicate names, or b. etcd should accept duplicate names in ETCD_INITIAL_CLUSTER and treat them as separate members. This can be accomplished by updating func NewClusterFromURLsMap as follows:

	c := NewCluster(lg, opts...)
	for name, urls := range urlsmap {
		for idx, _ := range urls {
			m := NewMember(name, urls[idx:idx+1], token, nil)
			[...]

I don't know if there's a real need to be able to specify multiple URLs for a single member.

How can we reproduce it (as minimally and precisely as possible)?

You need three terminals, x, y, and z:

x$ mkdir -p test_case/{a,b,c}/{data/member,wal}
x$ ETCD_INITIAL_CLUSTER="a=http://127.0.0.1:40000,b=http://127.0.0.1:40001" ETCD_INITIAL_CLUSTER_STATE=new etcd --name a --{initial-advertise,listen}-peer-urls=http://127.0.0.1:40000 --{advertise,listen}-client-urls=http://127.0.0.1:50000 --data-dir test_case/a/data --wal-dir test_case/a/wal
y$ ETCD_INITIAL_CLUSTER="a=http://127.0.0.1:40000,b=http://127.0.0.1:40001" ETCD_INITIAL_CLUSTER_STATE=new etcd --name b --{initial-advertise,listen}-peer-urls=http://127.0.0.1:40001 --{advertise,listen}-client-urls=http://127.0.0.1:50001 --data-dir test_case/b/data --wal-dir test_case/b/wal
[now kill both servers with Ctrl-C]
x$ etcd --listen-peer-urls=http://127.0.0.1:40000 --{advertise,listen}-client-urls=http://127.0.0.1:50000 --data-dir test_case/a/data --wal-dir test_case/a/wal
y$ etcd --listen-peer-urls=http://127.0.0.1:40001 --{advertise,listen}-client-urls=http://127.0.0.1:50001 --data-dir test_case/b/data --wal-dir test_case/b/wal
z$ ETCDCTL_ENDPOINT=http://localhost:50000 etcdctl member add c http://127.0.0.1:40002
Added member named c with ID 7b4d6e3edb76bc59 to cluster

ETCD_NAME="c"
ETCD_INITIAL_CLUSTER="default=http://127.0.0.1:40000,c=http://127.0.0.1:40002,default=http://127.0.0.1:40001"
ETCD_INITIAL_CLUSTER_STATE="existing"
z$ export ETCD_NAME="c"
z$ export ETCD_INITIAL_CLUSTER="default=http://127.0.0.1:40000,c=http://127.0.0.1:40002,default=http://127.0.0.1:40001"
z$ export ETCD_INITIAL_CLUSTER_STATE="existing"
z$ etcd --listen-peer-urls=http://127.0.0.1:40002 --{advertise,listen}-client-urls=http://127.0.0.1:50002 --data-dir test_case/c/data --wal-dir test_case/c/wal
[...]
member count is unequal

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

etcd 3.5.2

Relevant log output

No response

mortehu avatar Mar 03 '22 02:03 mortehu

A member can have multiple client or peer URLs. So in this case, you must specify the flag --name. But I agree that we should add a warning if the flag --name isn't present. Feel free to submit a PR for this. Thanks.

ahrtr avatar Mar 03 '22 06:03 ahrtr

@ahrtr Would it be okay if I work on this?

Divya063 avatar Apr 11 '22 11:04 Divya063

@Divya063 Definitely yes. Thank you!

ahrtr avatar Apr 11 '22 11:04 ahrtr

Hey @mortehu I was trying to reproduce the issue by your given commands. First of all, I think that this command on z terminal is wrong -> z$ ETCDCTL_ENDPOINT=http://localhost:50000 etcdctl member add c http://127.0.0.1:40002. As it gave me error: Error: too many arguments, did you mean --peer-urls=http://127.0.0.1:40002

After that I ran this command: ETCDCTL_ENDPOINT=http://localhost:50000 etcdctl member add c --peer-urls=http://127.0.0.1:40002 and the output was as follows.

{"level":"warn","ts":"2022-04-12T00:20:06.341-0700","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCDCTL_ENDPOINT=http://localhost:50000"}
Member f6f1fd0cdb6d6ac0 added to cluster cdf818194e3a8c32

ETCD_NAME="c"
ETCD_INITIAL_CLUSTER="default=http://localhost:2380,c=http://127.0.0.1:40002"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://127.0.0.1:40002"
ETCD_INITIAL_CLUSTER_STATE="existing"

After adding the member, I exported the required variables and executed etcd listen command: etcd --listen-peer-urls=http://127.0.0.1:40002 --{advertise,listen}-client-urls=http://127.0.0.1:50002 --data-dir test_case/c/data --wal-dir test_case/c/wal, but I didn't got any error of member count is unequal.

Instead, the error was:

{"level":"info","ts":"2022-04-12T00:21:32.887-0700","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"default=http://127.0.0.1:40000,c=http://127.0.0.1:40002,default=http://127.0.0.1:40001"}
{"level":"info","ts":"2022-04-12T00:21:32.887-0700","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"existing"}
{"level":"info","ts":"2022-04-12T00:21:32.887-0700","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"c"}
{"level":"info","ts":"2022-04-12T00:21:32.887-0700","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--listen-peer-urls=http://127.0.0.1:40002","--advertise-client-urls=http://127.0.0.1:50002","--listen-client-urls=http://127.0.0.1:50002","--data-dir","test_case/c/data","--wal-dir","test_case/c/wal"]}
{"level":"info","ts":"2022-04-12T00:21:32.887-0700","caller":"etcdmain/etcd.go:116","msg":"server has already been initialized","data-dir":"test_case/c/data","dir-type":"member"}
{"level":"info","ts":"2022-04-12T00:21:32.887-0700","caller":"embed/etcd.go:121","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:40002"]}
{"level":"info","ts":"2022-04-12T00:21:32.888-0700","caller":"embed/etcd.go:129","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:50002"]}
{"level":"info","ts":"2022-04-12T00:21:32.888-0700","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.6.0-alpha.0","git-sha":"7d3ca1f51","go-version":"go1.18","go-os":"linux","go-arch":"amd64","max-cpu-set":12,"max-cpu-available":12,"member-initialized":false,"name":"c","data-dir":"test_case/c/data","wal-dir":"test_case/c/wal","wal-dir-dedicated":"test_case/c/wal","member-dir":"test_case/c/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","wait-cluster-ready-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://127.0.0.1:40002"],"advertise-client-urls":["http://127.0.0.1:50002"],"listen-client-urls":["http://127.0.0.1:50002"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"c=http://127.0.0.1:40002,default=http://127.0.0.1:40000,default=http://127.0.0.1:40001","initial-cluster-state":"existing","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","discovery-token":"","discovery-endpoints":"","discovery-dial-timeout":"2s","discovery-request-timeout":"5s","discovery-keepalive-time":"2s","discovery-keepalive-timeout":"6s","discovery-insecure-transport":true,"discovery-insecure-skip-tls-verify":false,"discovery-cert":"","discovery-key":"","discovery-cacert":"","discovery-user":"","downgrade-check-interval":"5s","max-learners":1}
{"level":"warn","ts":"2022-04-12T00:21:32.888-0700","caller":"fileutil/fileutil.go:53","msg":"check file permission","error":"directory \"test_case/c/data\" exist, but the permission is \"drwxrwxr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"warn","ts":"2022-04-12T00:21:32.888-0700","caller":"fileutil/fileutil.go:53","msg":"check file permission","error":"directory \"test_case/c/data/member\" exist, but the permission is \"drwxrwxr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"info","ts":"2022-04-12T00:21:32.888-0700","caller":"storage/backend.go:81","msg":"opened backend db","path":"test_case/c/data/member/snap/db","took":"82.44µs"}
{"level":"warn","ts":"2022-04-12T00:21:32.888-0700","caller":"schema/schema.go:43","msg":"Failed to detect storage schema version. Please wait till wal snapshot before upgrading cluster."}
{"level":"info","ts":"2022-04-12T00:21:33.006-0700","caller":"embed/etcd.go:383","msg":"closing etcd server","name":"c","data-dir":"test_case/c/data","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:50002"]}
{"level":"info","ts":"2022-04-12T00:21:33.006-0700","caller":"embed/etcd.go:385","msg":"closed etcd server","name":"c","data-dir":"test_case/c/data","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:50002"]}
{"level":"fatal","ts":"2022-04-12T00:21:33.006-0700","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"error validating peerURLs {ClusterID:aab0e09a079f9f55 Members:[&{ID:33cf8d3d56df1746 RaftAttributes:{PeerURLs:[http://127.0.0.1:40000] IsLearner:false} Attributes:{Name:default ClientURLs:[http://127.0.0.1:50000]}} &{ID:8d0cef3f13600fd7 RaftAttributes:{PeerURLs:[http://127.0.0.1:40001] IsLearner:false} Attributes:{Name:default ClientURLs:[http://127.0.0.1:50001]}}] RemovedMemberIDs:[]}: PeerURLs: no match found for existing member (33cf8d3d56df1746, [http://127.0.0.1:40000]), last resolver error (len([\"http://127.0.0.1:40000\"]) != len([\"http://127.0.0.1:40000\" \"http://127.0.0.1:40001\"]))","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/home/nisarg1499/opensource/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/home/nisarg1499/opensource/etcd/server/etcdmain/main.go:40\nmain.main\n\t/home/nisarg1499/opensource/etcd/server/main.go:32\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

Can you please tell me where I went wrong in reproducing the error? I followed the same given commands for terminal x and y.

nisarg1499 avatar Apr 12 '22 07:04 nisarg1499

hey, I'm looking for a beginner-friendly issue if this one is available

keremgocen avatar Jun 14 '22 22:06 keremgocen

Thanks @keremgocen , let's double confirm with @Divya063 firstly to avoid doing duplicate work.

@Divya063 are you still working on this?

ahrtr avatar Jun 14 '22 23:06 ahrtr

@keremgocen Do let me know if you are bale to replicate the issue? I am also looking to work on some beginner-friendly issues. @ahrtr

nisarg1499 avatar Jun 14 '22 23:06 nisarg1499

Can you please tell me where I went wrong in reproducing the error? I followed the same given commands for terminal x and y.

Two comments:

  1. The environment variable should be ETCDCTL_ENDPOINTS instead of ETCDCTL_ENDPOINT;
  2. You need to start a cluster with multiple members, i.e. 3

ahrtr avatar Jun 14 '22 23:06 ahrtr

Can you please tell me where I went wrong in reproducing the error? I followed the same given commands for terminal x and y.

Two comments:

  1. The environment variable should be ETCDCTL_ENDPOINTS instead of ETCDCTL_ENDPOINT;
  2. You need to start a cluster with multiple members, i.e. 3

Thanks a lot for your reply. I'll check it.

nisarg1499 avatar Jun 15 '22 00:06 nisarg1499

Looks like no progress on this issue.

I would like to work on it.

nic-chen avatar Sep 15 '22 01:09 nic-chen

I read the relevant code and found that Config.Name only has an actual role when the member is started for the first time -- used to determine whether it is local or remote: https://github.com/etcd-io/etcd/blob/main/server/etcdserver/cluster_util.go#L129

At other times, it is just an identifier without any constraints. Even the same member can be started with a different name each time

So I am more inclined to accept duplicate names in ETCD_INITIAL_CLUSTER and treat them as separate members.

What's your opinion? Thanks! @serathius @ahrtr

nic-chen avatar Sep 17 '22 03:09 nic-chen

@nic-chen are you working on this? @ahrtr I was able to reproduce the issue. If @nic-chen is not working on this can I take it up? Also which of the two approaches would you suggest for solving the issue?

UtR491 avatar Oct 17 '22 17:10 UtR491

Just as I mentioned previously https://github.com/etcd-io/etcd/issues/13757#issuecomment-1057718054, each member can have multiple peer URLs. In the following example, http://1.1.1.1:2380 and http://2.2.2.2::2380 are regarded as two peer URLs of the member mach0. I don't think we should change this existing behavior.

mach0=http://1.1.1.1:2380,mach0=http://2.2.2.2::2380,mach1=http://3.3.3.3:2380,mach2=http://4.4.4.4:2380

I think we just need to print a warning message if users do not provide a value for --name.

ahrtr avatar Oct 17 '22 22:10 ahrtr

Just as I mentioned previously #13757 (comment), each member can have multiple peer URLs. In the following example, http://1.1.1.1:2380 and http://2.2.2.2::2380 are regarded as two peer URLs of the member mach0. I don't think we should change this existing behavior.

mach0=http://1.1.1.1:2380,mach0=http://2.2.2.2::2380,mach1=http://3.3.3.3:2380,mach2=http://4.4.4.4:2380

I think we just need to print a warning message if users do not provide a value for --name.

Thanks for the explanation! I missed that comment...

nic-chen avatar Oct 18 '22 09:10 nic-chen

@nic-chen are you working on this? @ahrtr I was able to reproduce the issue. If @nic-chen is not working on this can I take it up? Also which of the two approaches would you suggest for solving the issue?

hi @UtR491

sure, I reproduced and fixed it locally, just not finished testing, and I wanted to wait for a reply because I'm not that familiar with etcd.

A PR will be submitted this week.

If you could fix it and add test cases quickly, PR is welcome, I wouldn't mind.

nic-chen avatar Oct 18 '22 09:10 nic-chen