tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

If one DM-master is isolated from a 3-node cluster, all etcd queries will have 1/3 chance stuck

Open kennytm opened this issue 7 months ago • 3 comments

What did you do?

  1. Start cluster with 3 dm-master

    tiup playground v9.0.0-beta.1 --dm-master 3 --db 1 --kv 1 --pd 1 --tiflash 0 --without-monitor
    
  2. Check which node is the leader

    curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true'
    # suppose it replies "dm-master-1" = "127.0.0.1:8262"
    
  3. Suspend a NON-LEADER that is not dm-master-0.

    kill -STOP $(pgrep -f 'name=dm-master-2')
    
  4. Perform the API again. Every 3 execution it will time out once.

    curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true'
    curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true'
    curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true'
    curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true'
    curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true'
    curl -m 0.5 -v 'http://127.0.0.1:8261/apis/v1alpha1/members?leader=true'
    
  5. Resume that member. The API call now succeeded 100%

    kill -CONT $(pgrep -f 'name=dm-master-2')
    

What did you expect to see?

The API call should be able to avoid the suspended member.

What did you see instead?

The API call goes through a round-robin load-balancer (totally unnecessarily) and makes it timeout with 1/3 chance.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v9.0.0-beta.1
v7.1.5

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

kennytm avatar May 14 '25 06:05 kennytm

Note that this was a known issue in PD, see tikv/pd#6577, tikv/pd#7737 on how they fixed it. Perhaps dm-master should just reuse PD's etcdutil.CreateEtcdClient instead of reinventing its own.

kennytm avatar May 14 '25 06:05 kennytm

/assign OliverS929

D3Hunter avatar May 14 '25 07:05 D3Hunter

@D3Hunter: GitHub didn't allow me to assign the following users: OliverS929.

Note that only pingcap members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign OliverS929

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot[bot] avatar May 14 '25 07:05 ti-chi-bot[bot]