etcd icon indicating copy to clipboard operation
etcd copied to clipboard

tests: Migrate member add tests to common framework

Open clement2026 opened this issue 3 years ago • 15 comments

Conctext #13637

blocked by #14304

clement2026 avatar Jul 27 '22 08:07 clement2026

Codecov Report

Merging #14281 (d05c894) into main (012fc51) will decrease coverage by 0.27%. The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #14281      +/-   ##
==========================================
- Coverage   75.49%   75.22%   -0.28%     
==========================================
  Files         457      457              
  Lines       37084    37084              
==========================================
- Hits        27996    27895     -101     
- Misses       7341     7427      +86     
- Partials     1747     1762      +15     
Flag Coverage Δ
all 75.22% <ø> (-0.28%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/auth/simple_token.go 80.00% <0.00%> (-8.47%) :arrow_down:
client/pkg/v3/tlsutil/tlsutil.go 83.33% <0.00%> (-8.34%) :arrow_down:
client/pkg/v3/fileutil/purge.go 66.03% <0.00%> (-7.55%) :arrow_down:
client/v3/concurrency/mutex.go 61.64% <0.00%> (-5.48%) :arrow_down:
raft/rafttest/node.go 95.00% <0.00%> (-5.00%) :arrow_down:
server/storage/mvcc/watchable_store.go 84.42% <0.00%> (-4.72%) :arrow_down:
server/etcdserver/api/v3rpc/interceptor.go 73.43% <0.00%> (-4.17%) :arrow_down:
server/proxy/grpcproxy/watch.go 92.48% <0.00%> (-4.05%) :arrow_down:
server/etcdserver/txn/util.go 75.47% <0.00%> (-3.78%) :arrow_down:
client/pkg/v3/testutil/recorder.go 76.27% <0.00%> (-3.39%) :arrow_down:
... and 21 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Jul 27 '22 09:07 codecov-commenter

Found errror:

member_test.go:94: could not add member, err: match not found. Set EXPECT_DEBUG for more info Err: read /dev/ptmx: input/output error, last lines:

Let me know if you need help with debugging.

serathius avatar Jul 27 '22 09:07 serathius

This clue shows that a member of a cluster starts serving client before it's connected to all peers.

/home/runner/work/etcd/etcd/bin/etcd-22812: {"level":"warn","ts":"2022-07-27T08:46:33.653Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"a8f3458387986d","requested-member-add":"{ID:557cfa84711678b8 RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14847](https://github.com/etcd-io/etcd/runs/7536399620?check_suite_focus=true#step:5:14848)

To add or to remove a member from a cluster, a client need to wait a little bit. I've reproduced it on my computer.

Xnip2022-07-27_19-40-35

To correct the test cases, I'm gonna keep trying MemberAdd() for a few seconds.

clement2026 avatar Jul 27 '22 11:07 clement2026

Hmm, maybe we should use clus.WaitLeader()?

serathius avatar Jul 27 '22 11:07 serathius

One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.

On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.

serathius avatar Jul 27 '22 12:07 serathius

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

clement2026 avatar Jul 27 '22 14:07 clement2026

One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.

On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.

sg.

clement2026 avatar Jul 27 '22 14:07 clement2026

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.

serathius avatar Jul 27 '22 14:07 serathius

As I went through the code base, I found that clus.WaitLeader() is called in many integration test cases, not in any e2e test case or the process of cluster creation. Neither in old e2e test code nor in common framework e2e code, an equivalent function to clus.WaitLeader() was found. 😟

clement2026 avatar Jul 28 '22 07:07 clement2026

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.

Skipping test cases where cluster size is equal 1 is not likely going to make the tests work. The failed test cases at the moment are the ones with cluster size > 1.

When cluster size=1, the single node accepts MemberAdd request immediately and responses OK, ignoring the fact that the newly added backhole peer url would make the cluster unavailable.

clement2026 avatar Jul 28 '22 07:07 clement2026

I would recommend to rebase as there was a change that greatly reduces test flakiness https://github.com/etcd-io/etcd/pull/14283

serathius avatar Jul 28 '22 15:07 serathius

Maybe we need to implement a clus.WaitLeader() function in e2e cluster, and expose it in Cluster interface. What do you think?

type Cluster interface {
	Members() []Member
        // HERE
	WaitLeader() int
	Client() Client
	Close() error
}

type Member interface {
	Client() Client
	Start() error
	Stop()
}

type Client interface {
	Put(key, value string, opts config.PutOptions) error
	Get(key string, opts config.GetOptions) (*clientv3.GetResponse, error)
	Delete(key string, opts config.DeleteOptions) (*clientv3.DeleteResponse, error)
	Compact(rev int64, opts config.CompactOption) (*clientv3.CompactResponse, error)
	Status() ([]*clientv3.StatusResponse, error)
	HashKV(rev int64) ([]*clientv3.HashKVResponse, error)
	Health() error
	Defragment(opts config.DefragOption) error
	AlarmList() (*clientv3.AlarmResponse, error)
	AlarmDisarm(alarmMember *clientv3.AlarmMember) (*clientv3.AlarmResponse, error)

clement2026 avatar Jul 28 '22 16:07 clement2026

Do you need to wait for https://github.com/etcd-io/etcd/pull/14304 to be merged and then update this PR?

ahrtr avatar Aug 08 '22 21:08 ahrtr

Do you need to wait for #14304 to be merged and then update this PR?

Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?

clement2026 avatar Aug 09 '22 03:08 clement2026

Do you need to wait for #14304 to be merged and then update this PR?

Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?

Once https://github.com/etcd-io/etcd/pull/14304 is merged, you can rebase this PR.

ahrtr avatar Aug 09 '22 03:08 ahrtr

The E2E / test (linux-386-e2e) (pull_request) check failed due to an unhealthy cluster error

/home/runner/work/etcd/etcd/bin/etcd-22730: {"level":"warn","ts":"2022-08-14T11:45:32.476Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"d70245c72180528a","requested-member-add":"{ID:ccf9af529e52be7e RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14842](https://github.com/etcd-io/etcd/runs/7826140582?check_suite_focus=true#step:5:14843)

MemberAdd function worked fine in integration test, but failed in e2e test. That's because integration cluster has --strict-reconfig-check disabled by default and e2e cluster has `--strict-reconfig-check' enabled by default.

Therefore, in e2e cluster, MemberAdd function has to wait 5 seconds before e2e cluster isConnectedFullySince, or an unhealthy cluster error would occur.

https://github.com/etcd-io/etcd/blob/d05c894e854e1b8545c41ff26ac85c8b76917a7c/server/etcdserver/server.go#L1350-L1358

A PR #14360 was opened to discuss this.

clement2026 avatar Aug 18 '22 09:08 clement2026

Please resolve the conflict.

ahrtr avatar Sep 05 '22 07:09 ahrtr

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

  • Create cluster
  • Call MemberAdd
  • Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

serathius avatar Sep 05 '22 13:09 serathius

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

  • Create cluster
  • Call MemberAdd
  • Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

I was worried about making a single test function too large and complex. I'll try and see if I can handle this.

clement2026 avatar Sep 05 '22 14:09 clement2026

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

  • Create cluster
  • Call MemberAdd
  • Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

I followed your suggestion and it works great! I've force pushed. Please take a look.

clement2026 avatar Sep 06 '22 03:09 clement2026