tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

ticdc restart and changefeed lag reached more than 10min when inject network partition betweent pdleader and pdfollowers

Open Lily2025 opened this issue 1 year ago • 7 comments

What did you do?

1、run tpcc with threads 10 and warehouse 1000 2、After 10 minutes, simulates pd leader is network isolated from all pd followers fault start time:2023-06-13 09:01:47 3、After 10 minutes, recovery the fault fault recover time:2023-06-13 09:11:48

What did you expect to see?

lag is less than 30s

What did you see instead?

ticdc lag reached more than 10min after inject fault image

pd leader changed normally image

Versions of the cluster

git hash : 1e2f277f2e3d9b57b15db9a2a9b2c62832c071ca

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

Lily2025 avatar Jun 14 '23 08:06 Lily2025

/remove-area dm /area ticdc

Lily2025 avatar Jun 14 '23 08:06 Lily2025

image

fubinzh avatar Jun 14 '23 08:06 fubinzh

/severity major

fubinzh avatar Jun 15 '23 09:06 fubinzh

/assign @asddongmen

nongfushanquan avatar Sep 21 '23 02:09 nongfushanquan

@nongfushanquan: GitHub didn't allow me to assign the following users: asddongmen.

Note that only pingcap members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @asddongmen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot[bot] avatar Sep 21 '23 02:09 ti-chi-bot[bot]

inject network partition betweent pdleader and pdfollowers image

two ticdc crash image image

Lily2025 avatar Dec 21 '23 04:12 Lily2025

inject network partition between ticdc owner and all other pods,ticdc restart chaos start ~ chaos end:2024/02/28 19:05:36 ~ 2024/02/28 19:08:36 img_v3_028h_eceea3ca-f765-4c15-b4b3-1b92b7220d4g

ticdc logs: [2024/02/28 19:08:37.410 +08:00] [ERROR] [tso_dispatcher.go:562] ["[tso] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.200.49.116:2379: i/o timeout""] [2024/02/28 19:08:37.410 +08:00] [ERROR] [pd.go:228] ["updateTS error"] [txnScope=global] [error="context canceled"] errorVerbose="context canceled[ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/[email protected]/tso_dispatcher.go:118\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/[email protected]/client.go:803\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\tgithub.com/tikv/client-go/[email protected]/util/pd_interceptor.go:81\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:147\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:226\nsync.(*Map).Range\n\tsync/map.go:476\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:224\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:152\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:226\nsync.(*Map).Range\n\tsync/map.go:476\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:224\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:37.410 +08:00] [INFO] [tso_dispatcher.go:344] ["[tso] exit tso dispatcher"] [dc-location=global] [2024/02/28 19:08:37.410 +08:00] [INFO] [tso_client.go:139] ["close tso client"] [2024/02/28 19:08:37.410 +08:00] [INFO] [tso_client.go:150] ["tso client is closed"] [2024/02/28 19:08:37.410 +08:00] [INFO] [pd_service_discovery.go:664] ["[pd] close pd service discovery client"] [2024/02/28 19:08:37.410 +08:00] [INFO] [client.go:319] ["[pd] http client closed"] [source=tikv-driver] [2024/02/28 19:08:37.413 +08:00] [WARN] [upstream.go:299] ["etcd session close failed"] [error="etcdserver: requested lease not found"] [2024/02/28 19:08:37.413 +08:00] [INFO] [upstream.go:305] ["upstream closed"] [upstreamID=7340490029962833542] [2024/02/28 19:08:38.370 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"] [2024/02/28 19:08:38.379 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] errorVerbose="rpc error: code = DeadlineExceeded desc = context deadline exceeded[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/[email protected]/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/[email protected]/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:47.414 +08:00] [WARN] [check.go:88] ["check TiKV version failed"] [error="[CDC:ErrGetAllStoresFailed]get stores from pd failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] errorVerbose="[CDC:ErrGetAllStoresFailed]get stores from pd failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/version.CheckStoreVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:209\ngithub.com/pingcap/tiflow/pkg/version.CheckClusterVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:83\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:179\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:1016] ["[pd] update member urls"] [old-urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd:2379]"] [new-urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:1043] ["[pd] switch leader"] [new-leader=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/] [old-leader=] [2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:525] ["[pd] init cluster id"] [cluster-id=7340490029962833542] [2024/02/28 19:08:47.427 +08:00] [INFO] [client.go:606] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE] [2024/02/28 19:08:47.427 +08:00] [INFO] [tso_client.go:231] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/] [2024/02/28 19:08:47.428 +08:00] [INFO] [tso_dispatcher.go:323] ["[tso] tso dispatcher created"] [dc-location=global] [2024/02/28 19:08:47.428 +08:00] [INFO] [client.go:654] ["[pd] service mode changed"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE] [2024/02/28 19:08:47.429 +08:00] [INFO] [tikv_driver.go:200] ["using API V1."] [2024/02/28 19:08:47.429 +08:00] [INFO] [tso_dispatcher.go:441] ["[tso] tso stream is not ready"] [dc=global] [2024/02/28 19:08:48.371 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"] [2024/02/28 19:08:48.380 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] errorVerbose="rpc error: code = DeadlineExceeded desc = context deadline exceeded[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/[email protected]/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/[email protected]/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:57.429 +08:00] [ERROR] [tso_dispatcher.go:202] ["[tso] tso request is canceled due to timeout"] [dc-location=global] [error="[PD:client:ErrClientGetTSOTimeout]get TSO timeout"] [2024/02/28 19:08:57.429 +08:00] [ERROR] [tso_dispatcher.go:498] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-addr=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/] [error="[PD:client:ErrClientGetTSO]get TSO failed, %v: rpc error: code = Canceled desc = context canceled"] [2024/02/28 19:08:57.429 +08:00] [ERROR] [capture.go:335] ["reset capture failed"] [error="rpc error: code = Canceled desc = context canceled"] errorVerbose="rpc error: code = Canceled desc = context canceled[ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests\n\tgithub.com/tikv/pd/[email protected]/tso_stream.go:149\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\tgithub.com/tikv/pd/[email protected]/tso_dispatcher.go:763\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\tgithub.com/tikv/pd/[email protected]/tso_dispatcher.go:488\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/[email protected]/tso_dispatcher.go:104\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/[email protected]/client.go:803\ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock\n\tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:57.430 +08:00] [INFO] [capture.go:328] ["the capture routine has exited"] [2024/02/28 19:08:57.430 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = Canceled desc = context canceled"] errorVerbose="rpc error: code = Canceled desc = context canceled[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/[email protected]/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/[email protected]/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:57.430 +08:00] [ERROR] [server.go:298] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/cdc/server.(*server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:298\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:57.430 +08:00] [WARN] [server.go:139] ["cdc server exits with error"] [error="rpc error: code = Canceled desc = context canceled"] errorVerbose="rpc error: code = Canceled desc = context canceled[ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests\n\tgithub.com/tikv/pd/[email protected]/tso_stream.go:149\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\tgithub.com/tikv/pd/[email protected]/tso_dispatcher.go:763\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\tgithub.com/tikv/pd/[email protected]/tso_dispatcher.go:488\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/[email protected]/tso_dispatcher.go:104\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/[email protected]/client.go:803\ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock\n\tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/02/28 19:08:57.430 +08:00] [INFO] [capture.go:707] ["message router closed"] [captureID=a277c9b2-c0b6-4ef0-aa9d-3d51b50cd83f] [2024/02/28 19:08:57.432 +08:00] [INFO] [server.go:424] ["sort engine manager closed"] [duration=2.032547ms] [2024/02/28 19:08:57.432 +08:00] [INFO] [pd_service_discovery.go:577] ["[pd] exit member loop due to context canceled"] [2024/02/28 19:08:57.432 +08:00] [INFO] [resource_manager_client.go:295] ["[resource manager] exit resource token dispatcher"] [2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:240] ["exit tso dispatcher loop"] [2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:410] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global] [2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:344] ["[tso] exit tso dispatcher"] [dc-location=global] [2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:186] ["exit tso requests cancel loop"] [2024/02/28 19:08:57.432 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"] [2024/02/28 19:08:57.432 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = Canceled desc = context canceled target:tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"] [2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:134] ["closing tso client"] [2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:139] ["close tso client"] [2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:150] ["tso client is closed"] [2024/02/28 19:08:57.432 +08:00] [INFO] [pd_service_discovery.go:664] ["[pd] close pd service discovery client"] [2024/02/28 19:08:59.752 +08:00] [INFO] [helper.go:54] ["init log"] [file=/var/lib/ticdc/log/ticdc.log] [level=info] [2024/02/28 19:08:59.752 +08:00] [INFO] [tz.go:34] ["Use the timezone of the TiCDC server machine"] [timezoneName=System] [timezone=Asia/Shanghai] [2024/02/28 19:08:59.752 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v8.0.0-alpha] [git-hash=25ce29c2a1802bbb4cd26008f322728959a91f7a] [git-branch=heads/refs/tags/v8.0.0-alpha] [utc-build-time="2024-02-27 11:37:29"] [go-version="go version go1.21.6 linux/amd64"] [failpoint-build=false] [2024/02/28 19:08:59.752 +08:00] [INFO] [server.go:125] ["CDC server created"] [pd="[http://tc-pd:2379/]"] [config="{"addr":"0.0.0.0:8301","advertise-addr":"tc-ticdc-1.tc-ticdc-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:8301","log-file":"/var/lib/ticdc/log/ticdc.log","log-level":"info","log":{"file":{"max-size":301,"max-days":0,"max-backups":0},"error-output":"stderr"},"data-dir":"","gc-ttl":86400,"tz":"System","capture-session-ttl":10,"owner-flush-interval":50000000,"processor-flush-interval":50000000,"sorter":{"sort-dir":"/tmp/sorter","cache-size-in-mb":128},"security":{"ca-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"mtls":false,"client-user-required":false,"client-allowed-user":null},"kv-client":{"enable-multiplexing":true,\

Lily2025 avatar Feb 29 '24 07:02 Lily2025

@asddongmen will see whether it can be addressed by https://github.com/etcd-io/etcd/pull/17465#event-11888619658. If not, then I suggest we address it in long term.

flowbehappy avatar Apr 08 '24 09:04 flowbehappy

After the merge of https://github.com/pingcap/tiflow/pull/10881, the checkpointTs lag during pd-leader-io-hang cases was reduced to less than 120s, meeting the requirement. image

asddongmen avatar Apr 15 '24 06:04 asddongmen

@Lily2025: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot[bot] avatar May 24 '24 03:05 ti-chi-bot[bot]

/remove-type bug /type enhancement

Lily2025 avatar May 24 '24 03:05 Lily2025

closed

Lily2025 avatar May 28 '24 13:05 Lily2025