As I promised in https://github.com/etcd-io/etcd/pull/14354, trying to introduce to EtcdProcessCluster functions for changing cluster member list.

As an example rewrote e2e tests from mentioned pull request.

Aug 30 '22 11:08 biosvs

Please resolve the conflict.

Sep 05 '22 06:09 ahrtr

@ahrtr done, rebased on main

Sep 05 '22 09:09 biosvs

I was asked by @ahrtr to do it. In general I'm completely okay with the way my test is currently written. Honestly, idk how to simplify test without changing in common libraries. Maybe it's better to not to change anything at all (then feel free to reject PR, I'm okay with it).

Sep 05 '22 10:09 biosvs

For me problem is making list of members dynamic, which would be only used for this test. Could we maybe check if the test itself can be simpler? For testing endpoints-auto-sync-interval do we really need to start a whole etcd process?

My understanding is that we are just looking for a Resolver state updated log in proxy. I think that just calling a MemberAdd on cluster should be enough.

Sep 05 '22 10:09 serathius

Hmm, can't agree here. Even using logs for making progress in tests make me uncomfortable, honestly. But I definitely would not say that something is working based on logs. I prefer to think that "it doesn't work until you run it explicitly and get expected result".

Sep 05 '22 11:09 biosvs

I do not get time to have a deeper review on this PR, but my immediate feeling is it's a little over complicated.

The gap mentioned in https://github.com/etcd-io/etcd/pull/14354#discussion_r953155400 is that the current test framework has interface MemberAdd, but doesn't have interface method something like MemberLaunch or LaunchMember or StartMember or MemberStart. I think we do need to support dynamically changing the member in the framework.

Please always keep OCP (Open Close principle) in mind: Open for extension, but close for modification. When updating existing framework, please try not to change the existing logic. Instead, please try to make your new code adapt to the existing framework.

Sep 09 '22 06:09 ahrtr

I'm afraid we're stuck with this PR (because we're in a kinda closed loop).

Changed e2e test cannot be rewritten with common or integration framework, because we're testing interaction through different running instances (including updates via network).

The same story is here: https://github.com/etcd-io/etcd/blob/main/tests/e2e/ctl_v3_snapshot_test.go#L258

Both these tests need actual member list changes. Both tests do it via raw commands. I tried to add these commands to e2e framework, but you think it's unnecessary. You're contributors, and if you're saying that it's easer to use raw commands for couple of tests - let it be.

Sep 12 '22 12:09 biosvs

When I tried to enhance the existing e2e test framework, eventually it's similar to this PR. But the interface definition in my commit looks clearer, I also get rid of the nextSeq. Please consider to enhance this PR per https://github.com/ahrtr/etcd/commit/9649fd7538cecaaabdef88aa159129629262da96

Sep 13 '22 06:09 ahrtr

Codecov Report

Merging #14404 (9649fd7) into main (4d57eb8) will decrease coverage by 0.72%. The diff coverage is 48.48%.

:exclamation: Current head 9649fd7 differs from pull request most recent head 76ab474. Consider uploading reports for the commit 76ab474 to get more accurate results

@@            Coverage Diff             @@
##             main   #14404      +/-   ##
==========================================
- Coverage   75.47%   74.75%   -0.73%     
==========================================
  Files         457      457              
  Lines       37183    37207      +24     
==========================================
- Hits        28065    27813     -252     
- Misses       7361     7612     +251     
- Partials     1757     1782      +25

Flag	Coverage Δ
all	`74.75% <48.48%> (-0.73%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
client/v3/mock/mockserver/mockserver.go	`0.00% <0.00%> (ø)`
server/etcdserver/raft.go	`88.29% <71.42%> (-1.48%)`	:arrow_down:
etcdctl/ctlv3/command/move_leader_command.go	`73.68% <100.00%> (+2.63%)`	:arrow_up:
server/etcdserver/server.go	`84.63% <100.00%> (-0.93%)`	:arrow_down:
server/proxy/grpcproxy/election.go	`9.09% <0.00%> (-72.73%)`	:arrow_down:
server/proxy/grpcproxy/lock.go	`33.33% <0.00%> (-66.67%)`	:arrow_down:
...ver/proxy/grpcproxy/adapter/lock_client_adapter.go	`33.33% <0.00%> (-66.67%)`	:arrow_down:
...proxy/grpcproxy/adapter/election_client_adapter.go	`6.89% <0.00%> (-62.07%)`	:arrow_down:
etcdutl/etcdutl/migrate_command.go	`21.05% <0.00%> (-53.95%)`	:arrow_down:
...xy/grpcproxy/adapter/maintenance_client_adapter.go	`17.14% <0.00%> (-17.15%)`	:arrow_down:
... and 42 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Sep 13 '22 07:09 codecov-commenter

Maybe I didn't get the idea, but code looks broken for me:

Cluster state is omitted It's necessary to pass "initial-cluster-state existing" when new member is added (I think that was the reason to add SetInitialCluster to EtcdServerProcessConfig).
nextSeq With "epc.Cfg.ClusterSize", you'll try to reuse occupied ports in case of this sequence:

Start cluster of >1 nodes
Remove member (with least ports number)
Add member (it will try to use the same ports as the member with highest ports number)

Sep 13 '22 08:09 biosvs

The commit is just to demonstrate the high level idea instead of a PR ready for review.

Sep 13 '22 08:09 ahrtr

Then I didn't get the idea, sorry.

You got rid of things that was introduced in order to make code workable.

Sep 13 '22 08:09 biosvs

2. nextSeq With "epc.Cfg.ClusterSize", you'll try to reuse occupied ports in case of this sequence:

The initial value for nextSq is cfg.ClusterSize, and incremental by 1 each time calling StartNewProc. I don't understand why it can't be removed and use cfg.ClusterSize or len(epc.Procs) directly. You just need to update cfg.ClusterSize and epc.Procs each time after calling StartNewProc.

Sep 14 '22 02:09 ahrtr

Let's assume that initial configuration is:

base port 2000
2 nodes

Then node A has port 2000 + 15 = 2005, node B has port 2000 + 25 = 2010.

Now imaging that I'm first removing node A and then adding new node C. What will I get?

If I use cfg.ClusterSize or len(epc.Procs), I'll get port for new node = 2000 + 2*5 = 2010, which is overlapping with already running node B. That's why I need some counter that doesn't depend on the current state of cluster. That's why nextSq was introduced.

Sep 15 '22 09:09 biosvs

Let's assume that initial configuration is:
* base port 2000

* 2 nodes
Then node A has port 2000 + 1_5 = 2005, node B has port 2000 + 2_5 = 2010.

Now imaging that I'm first removing node A and then adding new node C. What will I get?

If I use cfg.ClusterSize or len(epc.Procs), I'll get port for new node = 2000 + 2*5 = 2010, which is overlapping with already running node B. That's why I need some counter that doesn't depend on the current state of cluster. That's why nextSq was introduced.

Make sense to me, thanks for the clarification.

Sep 15 '22 09:09 ahrtr

@biosvs Could you resolve the inline comments? Please also rebase this PR.

Oct 12 '22 02:10 ahrtr

Continue to work on this PR in https://github.com/etcd-io/etcd/pull/14589

Oct 14 '22 22:10 ahrtr

etcd
etcd copied to clipboard

Added member interaction into EtcdProcessCluster

Codecov Report

etcd etcd copied to clipboard

Added member interaction into EtcdProcessCluster

Codecov Report

etcd
etcd copied to clipboard