etcd
etcd copied to clipboard
tests: Migrate member add tests to common framework
Conctext #13637
blocked by #14304
Codecov Report
Merging #14281 (d05c894) into main (012fc51) will decrease coverage by
0.27%. The diff coverage isn/a.
@@ Coverage Diff @@
## main #14281 +/- ##
==========================================
- Coverage 75.49% 75.22% -0.28%
==========================================
Files 457 457
Lines 37084 37084
==========================================
- Hits 27996 27895 -101
- Misses 7341 7427 +86
- Partials 1747 1762 +15
| Flag | Coverage Δ | |
|---|---|---|
| all | 75.22% <ø> (-0.28%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| server/auth/simple_token.go | 80.00% <0.00%> (-8.47%) |
:arrow_down: |
| client/pkg/v3/tlsutil/tlsutil.go | 83.33% <0.00%> (-8.34%) |
:arrow_down: |
| client/pkg/v3/fileutil/purge.go | 66.03% <0.00%> (-7.55%) |
:arrow_down: |
| client/v3/concurrency/mutex.go | 61.64% <0.00%> (-5.48%) |
:arrow_down: |
| raft/rafttest/node.go | 95.00% <0.00%> (-5.00%) |
:arrow_down: |
| server/storage/mvcc/watchable_store.go | 84.42% <0.00%> (-4.72%) |
:arrow_down: |
| server/etcdserver/api/v3rpc/interceptor.go | 73.43% <0.00%> (-4.17%) |
:arrow_down: |
| server/proxy/grpcproxy/watch.go | 92.48% <0.00%> (-4.05%) |
:arrow_down: |
| server/etcdserver/txn/util.go | 75.47% <0.00%> (-3.78%) |
:arrow_down: |
| client/pkg/v3/testutil/recorder.go | 76.27% <0.00%> (-3.39%) |
:arrow_down: |
| ... and 21 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Found errror:
member_test.go:94: could not add member, err: match not found. Set EXPECT_DEBUG for more info Err: read /dev/ptmx: input/output error, last lines:
Let me know if you need help with debugging.
This clue shows that a member of a cluster starts serving client before it's connected to all peers.
/home/runner/work/etcd/etcd/bin/etcd-22812: {"level":"warn","ts":"2022-07-27T08:46:33.653Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"a8f3458387986d","requested-member-add":"{ID:557cfa84711678b8 RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14847](https://github.com/etcd-io/etcd/runs/7536399620?check_suite_focus=true#step:5:14848)
To add or to remove a member from a cluster, a client need to wait a little bit. I've reproduced it on my computer.

To correct the test cases, I'm gonna keep trying MemberAdd() for a few seconds.
Hmm, maybe we should use clus.WaitLeader()?
One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.
On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.
Hmm, maybe we should use
clus.WaitLeader()?
Wow, this is an amazing function, it looks like what I need. I'll give it try.
One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.
On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.
sg.
Hmm, maybe we should use clus.WaitLeader()?
Wow, this is an amazing function, it looks like what I need. I'll give it try.
I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.
As I went through the code base, I found that clus.WaitLeader() is called in many integration test cases, not in any e2e test case or the process of cluster creation.
Neither in old e2e test code nor in common framework e2e code, an equivalent function to clus.WaitLeader() was found. 😟
Hmm, maybe we should use clus.WaitLeader()?
Wow, this is an amazing function, it looks like what I need. I'll give it try.
I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.
Skipping test cases where cluster size is equal 1 is not likely going to make the tests work. The failed test cases at the moment are the ones with cluster size > 1.
When cluster size=1, the single node accepts MemberAdd request immediately and responses OK, ignoring the fact that the newly added backhole peer url would make the cluster unavailable.
I would recommend to rebase as there was a change that greatly reduces test flakiness https://github.com/etcd-io/etcd/pull/14283
Maybe we need to implement a clus.WaitLeader() function in e2e cluster, and expose it in Cluster interface. What do you think?
type Cluster interface {
Members() []Member
// HERE
WaitLeader() int
Client() Client
Close() error
}
type Member interface {
Client() Client
Start() error
Stop()
}
type Client interface {
Put(key, value string, opts config.PutOptions) error
Get(key string, opts config.GetOptions) (*clientv3.GetResponse, error)
Delete(key string, opts config.DeleteOptions) (*clientv3.DeleteResponse, error)
Compact(rev int64, opts config.CompactOption) (*clientv3.CompactResponse, error)
Status() ([]*clientv3.StatusResponse, error)
HashKV(rev int64) ([]*clientv3.HashKVResponse, error)
Health() error
Defragment(opts config.DefragOption) error
AlarmList() (*clientv3.AlarmResponse, error)
AlarmDisarm(alarmMember *clientv3.AlarmMember) (*clientv3.AlarmResponse, error)
Do you need to wait for https://github.com/etcd-io/etcd/pull/14304 to be merged and then update this PR?
Do you need to wait for #14304 to be merged and then update this PR?
Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?
Do you need to wait for #14304 to be merged and then update this PR?
Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?
Once https://github.com/etcd-io/etcd/pull/14304 is merged, you can rebase this PR.
The E2E / test (linux-386-e2e) (pull_request) check failed due to an unhealthy cluster error
/home/runner/work/etcd/etcd/bin/etcd-22730: {"level":"warn","ts":"2022-08-14T11:45:32.476Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"d70245c72180528a","requested-member-add":"{ID:ccf9af529e52be7e RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14842](https://github.com/etcd-io/etcd/runs/7826140582?check_suite_focus=true#step:5:14843)
MemberAdd function worked fine in integration test, but failed in e2e test. That's because integration cluster has --strict-reconfig-check disabled by default and e2e cluster has `--strict-reconfig-check' enabled by default.
Therefore, in e2e cluster, MemberAdd function has to wait 5 seconds before e2e cluster isConnectedFullySince, or an unhealthy cluster error would occur.
https://github.com/etcd-io/etcd/blob/d05c894e854e1b8545c41ff26ac85c8b76917a7c/server/etcdserver/server.go#L1350-L1358
A PR #14360 was opened to discuss this.
Please resolve the conflict.
Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:
- Create cluster
- Call MemberAdd
- Check whether it was successful or not
We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.
Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:
- Create cluster
- Call MemberAdd
- Check whether it was successful or not
We could just add
expectError,strictReconfigCheckandwaitForQuorumas fields on test case struct.
I was worried about making a single test function too large and complex. I'll try and see if I can handle this.
Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:
- Create cluster
- Call MemberAdd
- Check whether it was successful or not
We could just add
expectError,strictReconfigCheckandwaitForQuorumas fields on test case struct.
I followed your suggestion and it works great! I've force pushed. Please take a look.