tidb icon indicating copy to clipboard operation
tidb copied to clipboard

executor: disable closest replica read if cluster is not balanced

Open glorv opened this issue 2 years ago • 5 comments

What problem does this PR solve?

Issue Number: ref #35926

Problem Summary:

What is changed and how it works?

#35927 has introduce a new replica_read type closest-replica which will dispatch read request to the store within the same AZ. But in this mode, if the read traffic is not evenly distributed across AZs, it may cause unbalanced load in tikv and affect the overall performance.

This PR alleviate this problem by add a periodically check when closest-replica is enabled. Every 60s, it checks that all AZs contain both tidb and tikv instances, if not it will disables closest-replica and fallback to leader read. In this simple way, we can avoid the traffic skew in most cases.

NOTE: in my benchmark, there are two problems that may affect the effect of this check:

  1. The check depends on infosync.GetAllServerInfo to fetch all active tidb instances. This information is stored in etcd with a ttl of 10min. Because of #36793, tidb is likely to panic at exit and can't delete self from the etcd which may cause misjudgement.
  2. If the app uses long conneciton, when a new tidb is up, the traffic can't be dispatch to it because exist connection can't be moved. Thus though the cluster itself is even, but the traffic is still not even. This PR can't hanle this kind of cases.

Check List

Tests

  • [x] Unit test
  • [ ] Integration test
  • [x] Manual test (add detailed scripts or steps below)
  • [ ] No code

Side effects

  • [ ] Performance regression: Consumes more CPU
  • [ ] Performance regression: Consumes more Memory
  • [ ] Breaking backward compatibility

Documentation

  • [ ] Affects user behaviors
  • [ ] Contains syntax changes
  • [ ] Contains variable changes
  • [ ] Contains experimental features
  • [ ] Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

glorv avatar Aug 02 '22 13:08 glorv

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • nolouch
  • qw4990

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment. After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review. Reviewer can cancel approval by submitting a request changes review.

ti-chi-bot avatar Aug 02 '22 13:08 ti-chi-bot

@qw4990 @nolouch PTAL, thank you

glorv avatar Aug 02 '22 13:08 glorv

Code Coverage Details: https://codecov.io/github/pingcap/tidb/commit/0e349a3bc827fbc192fee308c42ad4d5e23b49eb

sre-bot avatar Aug 02 '22 13:08 sre-bot

/build

glorv avatar Aug 05 '22 04:08 glorv

@nolouch @qw4990 Could you please take a look? The related tidb-operator PR is merged now.

glorv avatar Aug 11 '22 06:08 glorv

@qw4990 @winoros PTAL

glorv avatar Aug 29 '22 06:08 glorv

/merge

qw4990 avatar Sep 07 '22 03:09 qw4990

This pull request has been accepted and is ready to merge.

Commit hash: 35741c816977b255b67a37bb25b3ff9725722800

ti-chi-bot avatar Sep 07 '22 03:09 ti-chi-bot

TiDB MergeCI notify

🔴 Bad News! New failing [2] after this pr merged. These new failed integration tests seem to be caused by the current PR, please try to fix these new failed integration tests, thanks!

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/integration-common-test 🟥 failed 3, success 14, total 17 19 min New failing
idc-jenkins-ci-tidb/integration-ddl-test 🟥 failed 1, success 5, total 6 5 min 18 sec New failing
idc-jenkins-ci/integration-cdc-test 🟢 all 37 tests passed 32 min Existing passed
idc-jenkins-ci-tidb/common-test 🟢 all 11 tests passed 20 min Existing passed
idc-jenkins-ci-tidb/tics-test 🟢 all 1 tests passed 6 min 44 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 5 min 32 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 4 min 34 sec Existing passed
idc-jenkins-ci-tidb/mybatis-test 🟢 all 1 tests passed 3 min 55 sec Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test 🟢 all 1 tests passed 3 min 22 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

sre-bot avatar Sep 07 '22 04:09 sre-bot

There is a goleak found related to br:

[2022-09-07T03:39:52.041Z] goleak: Errors on successful test run: found unexpected goroutines:

[2022-09-07T03:39:52.041Z] [Goroutine 16818 in state select, with go.etcd.io/etcd/client/v3.waitRetryBackoff on top of the stack:

[2022-09-07T03:39:52.041Z] goroutine 16818 [select]:

[2022-09-07T03:39:52.041Z] go.etcd.io/etcd/client/v3.waitRetryBackoff({0x422dcb8, 0xc000bd9c80}, 0x4?, 0xc001e78600?)

[2022-09-07T03:39:52.041Z] 	/go/pkg/mod/go.etcd.io/etcd/client/[email protected]/retry_interceptor.go:302 +0xa5

[2022-09-07T03:39:52.041Z] go.etcd.io/etcd/client/v3.(*Client).unaryClientInterceptor.func1({0x422dc80?, 0xc00243c300?}, {0x3cb51a1, 0x16}, {0x3ba3260, 0xc002318000}, {0x3b5bf20, 0xc002256050}, 0xc000d3c000, 0x3dab8b8, ...)

[2022-09-07T03:39:52.041Z] 	/go/pkg/mod/go.etcd.io/etcd/client/[email protected]/retry_interceptor.go:50 +0x1fa

[2022-09-07T03:39:52.041Z] google.golang.org/grpc.(*ClientConn).Invoke(0x60?, {0x422dc80?, 0xc00243c300?}, {0x3cb51a1?, 0x6?}, {0x3ba3260?, 0xc002318000?}, {0x3b5bf20?, 0xc002256050?}, {0xc00243c360, ...})

[2022-09-07T03:39:52.041Z] 	/go/pkg/mod/google.golang.org/[email protected]/call.go:35 +0x223

[2022-09-07T03:39:52.041Z] go.etcd.io/etcd/api/v3/etcdserverpb.(*kVClient).Range(0xc000fb9fb0, {0x422dc80, 0xc00243c300}, 0xc002318000?, {0xc00243c360, 0x4, 0x6})

[2022-09-07T03:39:52.041Z] 	/go/pkg/mod/go.etcd.io/etcd/api/[email protected]/etcdserverpb/rpc.pb.go:6460 +0xc9

[2022-09-07T03:39:52.041Z] go.etcd.io/etcd/client/v3.(*retryKVClient).Range(0xc0021a3200, {0x422dc80, 0xc00243c300}, 0x691a80?, {0x6408be0, 0x3, 0x3})

[2022-09-07T03:39:52.041Z] 	/go/pkg/mod/go.etcd.io/etcd/client/[email protected]/retry.go:105 +0x133

[2022-09-07T03:39:52.041Z] go.etcd.io/etcd/client/v3.(*kv).Do(0xc001252d80, {_, _}, {0x1, {0xc0017be120, 0x11, 0x18}, {0xc0017be138, 0x11, 0x11}, ...})

[2022-09-07T03:39:52.041Z] 	/go/pkg/mod/go.etcd.io/etcd/client/[email protected]/kv.go:149 +0x1e8

[2022-09-07T03:39:52.041Z] go.etcd.io/etcd/client/v3.(*kv).Get(0x422dc48?, {0x422dc80, 0xc00243c300}, {0x3ca2f12?, 0x2?}, {0xc001502370?, 0x0?, 0x0?})

[2022-09-07T03:39:52.041Z] 	/go/pkg/mod/go.etcd.io/etcd/client/[email protected]/kv.go:119 +0xdc

[2022-09-07T03:39:52.041Z] github.com/pingcap/tidb/domain/infosync.getInfo({0x422dc48?, 0xc000328000?}, 0xc001818e00, {0x3ca2f12, 0x11}, 0x5, 0x30?, {0xc001502370, 0x1, 0x1})

[2022-09-07T03:39:52.041Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/domain/infosync/info.go:938 +0x17e

[2022-09-07T03:39:52.041Z] github.com/pingcap/tidb/domain/infosync.(*InfoSyncer).getAllServerInfo(0xc0025ce1c0, {0x422dc48, 0xc000328000})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/domain/infosync/info.go:594 +0xc7

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/domain/infosync.GetAllServerInfo({0x422dc48, 0xc000328000})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/domain/infosync/info.go:341 +0x45

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/infoschema.GetTiDBServerInfo({0xc001f88ad0?, 0x33afad9?})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/infoschema/tables.go:1647 +0x3c

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/infoschema.GetClusterServerInfo({0x4292410, 0xc001b79b80})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/infoschema/tables.go:1632 +0xf9

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.fetchClusterConfig({0x4292410, 0xc001b79b80}, 0xc001f88e88, 0xc001f88e88)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/memtable_reader.go:170 +0x70

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.(*ShowExec).fetchShowClusterConfigs(0xc00115e840, {0x0?, 0x400?})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/show.go:1253 +0x11e

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.(*ShowExec).fetchAll(0x4231ee0?, {0x422dcb8?, 0xc001bd89f0?})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/show.go:150 +0x18c

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.(*ShowExec).Next(0xc00115e840, {0x422dcb8, 0xc001bd89f0}, 0xc0007a64b0)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/show.go:115 +0xc8

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.Next({0x422dcb8, 0xc001bd89f0}, {0x4231ee0, 0xc00115e840}, 0xc0007a64b0)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/executor.go:324 +0x4f2

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.(*SelectionExec).Next(0xc001eb8410, {0x422dcb8, 0xc001bd89f0}, 0xc0007a6640)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/executor.go:1560 +0xf7

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.Next({0x422dcb8, 0xc001bd89f0}, {0x4231d20, 0xc001eb8410}, 0xc0007a6640)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/executor.go:324 +0x4f2

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.(*ExecStmt).next(0xc002926870, {0x422dcb8, 0xc001bd89f0}, {0x4231d20, 0xc001eb8410}, 0xc0006fa800?)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/adapter.go:937 +0x78

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/executor.(*recordSet).Next(0xc0007a65f0, {0x422dcb8?, 0xc001bd89f0?}, 0xc0007a6640)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/executor/adapter.go:152 +0xc5

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/session.drainRecordSet({0x422dcb8, 0xc001bd89f0}, 0xc001b79b80, {0x422e540, 0xc001bd93b0}, {0x0?, 0x0?})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/session/session.go:1284 +0xea

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/session.(*session).ExecRestrictedSQL.func1({0x422dcb8, 0xc001bd8990}, 0xc001b79b80)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/session/session.go:1940 +0x2f7

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/session.(*session).withRestrictedSQLExecutor(0x38469e0?, {0x422dcb8, 0xc001bd8990}, {0x0, 0x0, 0xc000328000?}, 0xc0013b38c0)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/session/session.go:1913 +0x2e8

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/session.(*session).ExecRestrictedSQL(0xafef70f690bb78f5?, {0x422dcb8?, 0xc001bd8990?}, {0x0?, 0x0?, 0x0?}, {0x3d082b1?, 0xc000ef7440?}, {0x0, 0x0, ...})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/session/session.go:1917 +0x8e

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/br/pkg/utils.IsLogBackupEnabled({0x7fa21c329cd8, 0xc001b78780})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/br/pkg/utils/db.go:72 +0xa2

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/br/pkg/utils.CheckLogBackupEnabled({0x4292410?, 0xc001b78780?})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/br/pkg/utils/db.go:54 +0x56

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/store/gcworker.(*GCWorker).checkLeader(0xc000d28000, {0x422dcb8, 0xc00265f530})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/store/gcworker/gc_worker.go:1793 +0x12f

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/store/gcworker.(*GCWorker).tick(0xc0013b3e60?, {0x422dcb8, 0xc00265f530})

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/store/gcworker/gc_worker.go:286 +0x45

[2022-09-07T03:39:52.042Z] github.com/pingcap/tidb/store/gcworker.(*GCWorker).start(0xc000d28000, {0x422dcb8, 0xc00265f530}, 0xc000328008?)

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/store/gcworker/gc_worker.go:229 +0x4e5

[2022-09-07T03:39:52.042Z] created by github.com/pingcap/tidb/store/gcworker.(*GCWorker).Start

[2022-09-07T03:39:52.042Z] 	/home/jenkins/agent/workspace/tidb_ghpr_integration_ddl_test/go/src/github.com/pingcap/tidb/store/gcworker/gc_worker.go:120 +0x118

[2022-09-07T03:39:52.042Z] 

[2022-09-07T03:39:52.042Z] ]

/cc @3pointer https://ci.pingcap.net/blue/organizations/jenkins/tidb_ghpr_integration_ddl_test/detail/tidb_ghpr_integration_ddl_test/11277/pipeline

glorv avatar Sep 07 '22 04:09 glorv