If one DM-master is isolated from a 3-node cluster, all etcd queries will have 1/3 chance stuck
What did you do?
-
Start cluster with 3 dm-master
tiup playground v9.0.0-beta.1 --dm-master 3 --db 1 --kv 1 --pd 1 --tiflash 0 --without-monitor -
Check which node is the leader
curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true' # suppose it replies "dm-master-1" = "127.0.0.1:8262" -
Suspend a NON-LEADER that is not dm-master-0.
kill -STOP $(pgrep -f 'name=dm-master-2') -
Perform the API again. Every 3 execution it will time out once.
curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true' curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true' curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true' curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true' curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true' curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true' -
Resume that member. The API call now succeeded 100%
kill -CONT $(pgrep -f 'name=dm-master-2')
What did you expect to see?
The API call should be able to avoid the suspended member.
What did you see instead?
The API call goes through a round-robin load-balancer (totally unnecessarily) and makes it timeout with 1/3 chance.
Versions of the cluster
DM version (run dmctl -V or dm-worker -V or dm-master -V):
v9.0.0-beta.1
v7.1.5
current status of DM cluster (execute query-status <task-name> in dmctl)
No response
Note that this was a known issue in PD, see tikv/pd#6577, tikv/pd#7737 on how they fixed it. Perhaps dm-master should just reuse PD's etcdutil.CreateEtcdClient instead of reinventing its own.
/assign OliverS929
@D3Hunter: GitHub didn't allow me to assign the following users: OliverS929.
Note that only pingcap members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
/assign OliverS929
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.