tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

changefeed task failed after inject s3 failure last for 1mins

Open Lily2025 opened this issue 11 months ago • 3 comments

What did you do?

1、br restore data 2、create changefeed with redo log config: "memory-quota = 2147483648\n[consistent]\nlevel = "eventual"\nflush-interval = 500\nstorage = "s3://tmp/test-changefeed?access-key=xxxn&secret-access-key=xxx&endpoint=http%3a%2f%2fminio-peer%3a9000""

3、inject s3 failure last for 1mins [2024/03/04 21:00:07.457 +08:00] [INFO] [chaos.go:74] ["Run chaos"] [chaosId="ns=endless-ha-test-ticdc-tps-7170215-1-752,kind=failure,name=pod-failure-dkneisww,spec=&k8s.ChaosIdentifier{Namespace:"endless-ha-test-ticdc-tps-7170215-1-752", Name:"pod-failure-dkneisww", Spec:FailureExperimentSpec{Duration: "", Scheduler: }}"] [name=failure] [selectors="[endless-ha-test-ticdc-tps-7170215-1-752/minio-0]"] [selectorsRetainPolicy(selectors)="[endless-ha-test-ticdc-tps-7170215-1-752/minio-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="FailureExperimentSpec{Duration: "", Scheduler: }"] [2024/03/04 21:01:07.458 +08:00] [INFO] [chaos.go:92] ["Clean chaos"] [name=failure] [chaosId="ns=endless-ha-test-ticdc-tps-7170215-1-752,kind=failure,name=pod-failure-dkneisww,spec=&k8s.ChaosIdentifier{Namespace:"endless-ha-test-ticdc-tps-7170215-1-752", Name:"pod-failure-dkneisww", Spec:FailureExperimentSpec{Duration: "", Scheduler: }}"]

ticdc logs: ticdc-1.tar.gz

What did you expect to see?

changefeed can recover automatically after fault recover

What did you see instead?

changefeed task failed after inject s3 failure last for 1mins

{ "upstream_id": 7342169415863000245, "namespace": "default", "id": "ticdc-task", "sink_uri": "mysql://root:xxxxx@cdc-downstream-tc-tidb.endless-ha-test-ticdc-tps-7170215-1-752:4000", "create_time": "2024-03-04 00:42:24.389", "start_ts": 448131011512107011, "resolved_ts": 0, "target_ts": 0, "checkpoint_tso": 448150164062077038, "checkpoint_time": "2024-03-04 21:00:05.437", "sort_engine": "unified", "state": "failed", "error": null, "error_history": null, "creator_version": "v8.0.0-alpha" }

ticdc logs: [2024/03/04 21:00:08.356 +08:00] [ERROR] [manager.go:269] ["redo: failed to create redo log writer"] [namespace=default] [changefeed=ticdc-task] [duration=1.046949ms] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.NewLogWriter\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/mem_log_writer.go:53\ngithub.com/pingcap/tiflow/cdc/redo/writer/factory.NewRedoLogWriter\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/factory/factory.go:53\ngithub.com/pingcap/tiflow/cdc/redo.(*logManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/manager.go:267\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func5\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:654\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/03/04 21:00:08.357 +08:00] [INFO] [manager.go:590] ["redo manager closed"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.357 +08:00] [INFO] [shared_stream.go:149] ["event feed going to create grpc stream"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=13] [storeID=1] [addr=tc-tikv-0.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [2024/03/04 21:00:08.357 +08:00] [INFO] [shared_stream.go:149] ["event feed going to create grpc stream"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=14] [storeID=4] [addr=tc-tikv-2.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.373 +08:00] [INFO] [tz.go:34] ["Use the timezone of the TiCDC server machine"] [timezoneName=System] [timezone=Asia/Shanghai] [2024/03/04 21:00:08.379 +08:00] [ERROR] [changefeed.go:299] ["changefeed tick failed"] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).preStart\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:140\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:170\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func6\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:663\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/03/04 21:00:08.379 +08:00] [ERROR] [changefeed.go:306] ["an error occurred in Owner"] [namespace=default] [changefeed=ticdc-task] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).preStart\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:140\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:170\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func6\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:663\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/03/04 21:00:08.379 +08:00] [INFO] [ddl_puller.go:756] ["close the ddl puller"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [2024/03/04 21:00:08.379 +08:00] [INFO] [ddl_sink.go:281] ["owner ddl sink background loop exits"] [namespace=default] [changefeed=ticdc-task] [error="context canceled"] [2024/03/04 21:00:08.379 +08:00] [INFO] [shared_stream.go:157] ["event feed grpc stream exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=13] [storeID=1] [addr=tc-tikv-0.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [canceled=true] [2024/03/04 21:00:08.379 +08:00] [INFO] [shared_stream.go:157] ["event feed grpc stream exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=14] [storeID=4] [addr=tc-tikv-2.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [canceled=true] [2024/03/04 21:00:08.379 +08:00] [INFO] [shared_client.go:268] ["event feed exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [2024/03/04 21:00:08.379 +08:00] [INFO] [multiplexing_puller.go:242] ["MultiplexingPuller exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [2024/03/04 21:00:08.380 +08:00] [WARN] [server.go:499] ["topic handler returned error"] [error="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled"] [errorVerbose="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister.func1\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:230\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:252\ngithub.com/pingcap/tiflow/pkg/p2p.(*MessageServer).run.func2\n\tgithub.com/pingcap/tiflow/pkg/p2p/server.go:275\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/03/04 21:00:08.380 +08:00] [INFO] [coordinator.go:261] ["schedulerv3: coordinator closed"] [ownerRev="{"revision":244137}"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.380 +08:00] [INFO] [changefeed.go:780] ["changefeed closed"] [namespace=default] [changefeed=ticdc-task] [isRemoved=false] [2024/03/04 21:00:08.380 +08:00] [WARN] [server.go:297] ["handler not found"] [topic=changefeed/default/ticdc-task/scheduler] [2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:854] ["Processor drop sort engine successfully"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:860] ["Processor try to close agent"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.423 +08:00] [WARN] [server.go:499] ["topic handler returned error"] [error="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled"] errorVerbose="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister.func1\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:230\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:252\ngithub.com/pingcap/tiflow/pkg/p2p.(*MessageServer).run.func2\n\tgithub.com/pingcap/tiflow/pkg/p2p/server.go:275\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:869] ["Processor closed agent successfully"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:878] ["processor closed"] [namespace=default] [changefeed=ticdc-task] [2024/03/04 21:00:08.423 +08:00] [WARN] [server.go:297] ["handler not found"] [topic=changefeed/default/ticdc-task/agent]

Versions of the cluster

./cdc version Release Version: v8.0.0-alpha Git Commit Hash: 31dd24f902d081e50f16a19d5fb3d7b0fdfae5b5 Git Branch: heads/refs/tags/v8.0.0-alpha UTC Build Time: 2024-03-01 11:38:18 Go Version: go version go1.21.6 linux/amd64 Failpoint Build: false

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

Lily2025 avatar Mar 05 '24 06:03 Lily2025

/remove-area dm /area ticdc

Lily2025 avatar Mar 05 '24 06:03 Lily2025

/type bug /assign sdojjy

Lily2025 avatar Mar 05 '24 06:03 Lily2025

/severity major

fubinzh avatar Mar 07 '24 07:03 fubinzh