tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

[CDC:ErrTableListenReplicated]A table(xxx) is being replicated by at least two processors

Open yilongrong opened this issue 1 year ago • 2 comments

What did you do?

  1. cdc.log met error: [changefeed.go:106] ["an error occurred in Owner"] [changefeedID=kafka-tc-yxt-prod] [error="[CDC:ErrTableListenReplicated]A table(12920) is being replicated by at least two processors(aefbf88f-eb9e-4d3e-b1bb-4b1e6a705b0f, 5af5719f-88ed-4711-837f-e64dbfd12e62), please report a bug"] [errorVerbose="[CDC:ErrTableListenReplicated]A table(12920) is being replicated by at least two processors(aefbf88f-eb9e-4d3e-b1bb-4b1e6a705b0f, 5af5719f-88ed-4711-837f-e64dbfd12e62), please report a bug\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/owner.(*scheduler).table2CaptureIndex\n\tgithub.com/pingcap/ticdc/cdc/owner/scheduler.go:150\ngithub.com/pingcap/ticdc/cdc/owner.(*scheduler).syncTablesWithCurrentTables\n\tgithub.com/pingcap/ticdc/cdc/owner/scheduler.go:231\ngithub.com/pingcap/ticdc/cdc/owner.(*scheduler).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/scheduler.go:74\ngithub.com/pingcap/ticdc/cdc/owner.(*changefeed).tick\n\tgithub.com/pingcap/ticdc/cdc/owner/changefeed.go:169\ngithub.com/pingcap/ticdc/cdc/owner.(*changefeed).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/changefeed.go:105\ngithub.com/pingcap/ticdc/cdc/owner.(*Owner).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/owner.go:134\ngithub.com/pingcap/ticdc/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/ticdc/pkg/orchestrator/etcd_worker.go:187\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).runEtcdWorker\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:271\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).campaignOwner\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:250\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).run.func2\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:175\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

  2. Looking up logs based on TableID 12920 is a Create Table operation [async_sink.go:137] ["Execute DDL succeeded"] [changefeed=kafka-tc-yxt-prod] [ignored=false] [ddl="{\"StartTs\":434867736247009555,\"CommitTs\":434867736286331009,\"TableInfo\":{\"Schema\":\"yxt\",\"Table\":\"ssp_platform_rank\",\"TableID\":12920,\"ColumnInfo\":[{\"Name\":\"id\",\"Type\":8},{\"Name\":\"user_id\",\"Type\":254},{\"Name\":\"org_id\",\"Type\":254},{\"Name\":\"score\",\"Type\":246},{\"Name\":\"score_type\",\"Type\":1},{\"Name\":\"date_type\",\"Type\":1},{\"Name\":\"create_time\",\"Type\":12},{\"Name\":\"update_time\",\"Type\":12}]},\"PreTableInfo\":null,\"Query\":\"CREATE TABLE IF NOT EXISTS xxx.xxxx (id BIGINT(36) NOT NULL COMMENT 'id',user_id CHAR(36) NOT NULL COMMENT '用户id',org_id CHAR(36) NOT NULL COMMENT 'xxid',score DECIMAL(18,2) NOT NULL COMMENT 'score',score_type TINYINT(4) NOT NULL COMMENT '1x,2xx',date_type TINYINT(4) NOT NULL COMMENT '0x,1x,2x,3x',create_time DATETIME NOT NULL COMMENT 'create_time',update_time DATETIME NOT NULL COMMENT 'update_time',PRIMARY KEY(id)) COMMENT = 'xxx'\",\"Type\":3}"]

  3. Check the TICDC logs before the error is reported. Two Add A Table operations result in two Add Table pipeline operations ----5.147 Jul 27, 2022 @ 10:26:09.620 [owner.go:268] ["add a table"] [tableId=12920] [boundaryTs=434867736286331009] ---5.147 Jul 27, 2022 @ 10:26:09.628 [owner.go:268] ["add a table"] [tableId=12920] [boundaryTs=434867736286331009] ------5.143 Jul 27, 2022 @ 10:26:09.633 [processor.go:740] ["Add table pipeline"] [tableID=12920] [changefeed=kafka-tc-yxt-prod][name=yxt.xxxxx] [replicaInfo=" {\"start-ts\":434867736286331009,\"mark-table-id\":0} "] [globalResolvedTs=434867736286331009] ----5.147 Jul 27, 2022 @ 10:26:09.634 [processor.go:740] ["Add table pipeline"] [tableID=12920] [changefeed=kafka-tc-yxt-prod][name=yxt.ssp_platform_rank] [replicaInfo=" {\"start-ts\":434867736286331009,\"mark-table-id\":0} "] [globalResolvedTs=434867736286331009]

  4. Looking at the TiCDC log, I found that both 5.147 and 5.143 were trying Lock regionID=535009, and the CheckPointts boundaryTs works the same as the Add Table [owner.go:268] ["add a table"] [tableId=12920] [boundaryTs=434867736286331009] ---5.147 ticdc [region_range_lock.go:218] ["range locked"] [lockID=856] [regionID=535009] [startKey=7480000000000032ff785f720000000000fa] [endKey=7480000000000032ff785f730000000000fa] [checkpointTs=434867736286331009] ---5.143 ticdc [region_range_lock.go:218] ["range locked"] [lockID=845] [regionID=535009] [startKey=7480000000000032ff785f720000000000fa] [endKey=7480000000000032ff785f730000000000fa] [checkpointTs=434867736286331009]

  5. Search for regionID=535009 in the TICDC log: [region_worker.go:249] ["single region event feed disconnected"] [regionID=535009] [requestID=160574] [span="[7480000000000032ff745f720000000000fa, 7480000000000032ff745f730000000000fa)"] [checkpoint=434867736050663602] [error="[CDC:ErrEventFeedEventError]epoch_not_match:<current_regions:<id:9708689 start_key:\"t\\200\\000\\000\\000\\000\\0002\\377t\\000\\000\\000\\000\\000\\000\\000\\370\" end_key:\"t\\200\\000\\000\\000\\000\\0002\\377x\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:24125 version:9350 > peers:<id:9708690 store_id:4 > peers:<id:9708691 store_id:6 > peers:<id:9708692 store_id:5 > > current_regions:<id:535009 start_key:\"t\\200\\000\\000\\000\\000\\0002\\377x\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:24125 version:9350 > peers:<id:9320591 store_id:4 > peers:<id:9658463 store_id:6 > peers:<id:9663081 store_id:5 > > > "] [client.go:721] ["creating new stream to store to send request"] [regionID=535009] [requestID=160584] [storeID=4] [addr=10.70.5.43:20160]

  6. Search for the TIkV. log containing regionID=535009. It is found that regionID=535009 is in a split region [peer.rs:3831] ["on split"] [source=ipv4:10.xx.xx.14:43170] [split_keys="key 7480000000000032FF7800000000000000F8"] [peer_id=9320591] [region_id=535009] [pd.rs:749] ["try to batch split region"] [task=batch_split] [region="id: 535009 start_key: 7480000000000032FF7400000000000000F8 region_epoch { conf_ver: 24125 version: 9349 } peers { id: 9320591 store_id: 4 } peers { id: 9658463 store_id: 6 } peers { id: 9663081 store_id: 5 }"] [new_region_ids="[new_region_id: 9708689 new_peer_ids: 9708690 new_peer_ids: 9708691 new_peer_ids: 9708692]"] [region_id=535009] [peer.rs:3831] ["on split"] [source=ipv4:10.xx.xx.14:43186] [split_keys="key 7480000000000032FF7800000000000000F8"] [peer_id=9320591] [region_id=535009]

  7. TiCDC errors are reported in the tikv.log [delegate.rs:331] ["cdc met region error"] [error="Request(message: \"EpochNotMatch batchsplit\" epoch_not_match { current_regions { id: 9708689 start_key: 7480000000000032FF7400000000000000F8 end_key: 7480000000000032FF7800000000000000F8 region_epoch { conf_ver: 24125 version: 9350 } peers { id: 9708690 store_id: 4 } peers { id: 9708691 store_id: 6 } peers { id: 9708692 store_id: 5 } } current_regions { id: 535009 start_key: 7480000000000032FF7800000000000000F8 region_epoch { conf_ver: 24125 version: 9350 } peers { id: 9320591 store_id: 4 } peers { id: 9658463 store_id: 6 } peers { id: 9663081 store_id: 5 } } })"] [region_id=535009] [endpoint.rs:386] ["cdc deregister"] [deregister="Deregister { deregister: \"delegate\", region_id: 535009, observe_id: ObserveID(470582), err: Request(message: \"EpochNotMatch batchsplit\" epoch_not_match { current_regions { id: 9708689 start_key: 7480000000000032FF7400000000000000F8 end_key: 7480000000000032FF7800000000000000F8 region_epoch { conf_ver: 24125 version: 9350 } peers { id: 9708690 store_id: 4 } peers { id: 9708691 store_id: 6 } peers { id: 9708692 store_id: 5 } } current_regions { id: 535009 start_key: 7480000000000032FF7800000000000000F8 region_epoch { conf_ver: 24125 version: 9350 } peers { id: 9320591 store_id: 4 } peers { id: 9658463 store_id: 6 } peers { id: 9663081 store_id: 5 } } }) }"] [endpoint.rs:382] ["deregister observe region"] [observe_id=ObserveID(470579)] [region_id=535009] [store_id=Some(4)] [endpoint.rs:309] ["register observe region"] [region="id: 535009 start_key: 7480000000000032FF7800000000000000F8 region_epoch { conf_ver: 24125 version: 9350 } peers { id: 9320591 store_id: 4 } peers { id: 9658463 store_id: 6 } peers { id: 9663081 store_id: 5 }"] [endpoint.rs:233] ["Resolver initialized"] [pending_data_index=0] [snapshot_index=46313446] [observe_id=ObserveID(470583)] [region=535009] [endpoint.rs:523] ["cdc register region"] [downstream_id=DownstreamID(210251)] [req_id=160584] [conn_id=ConnID(4181)] [region_id=535009] [endpoint.rs:523] ["cdc register region"] [downstream_id=DownstreamID(210250)] [req_id=188637] [conn_id=ConnID(4180)] [region_id=535009] [endpoint.rs:1300] ["cdc resolver initialized and schedule resolver ready"] [takes=0ns] [observe_id=ObserveID(470589)] [lock_count=0] [resolved_ts=0] [downstream_id=DownstreamID(210250)] [conn_id=ConnID(4180)] [region_id=535009] [delegate.rs:385] ["cdc region is ready"] [region_id=535009]

What did you expect to see?

TiCDC should no error

What did you see instead?

[changefeed.go:106] ["an error occurred in Owner"] [changefeedID=kafka-tc-yxt-prod] [error="[CDC:ErrTableListenReplicated]A table(12920) is being replicated by at least two processors(aefbf88f-eb9e-4d3e-b1bb-4b1e6a705b0f, 5af5719f-88ed-4711-837f-e64dbfd12e62), please report a bug"] [errorVerbose="[CDC:ErrTableListenReplicated]A table(12920) is being replicated by at least two processors(aefbf88f-eb9e-4d3e-b1bb-4b1e6a705b0f, 5af5719f-88ed-4711-837f-e64dbfd12e62), please report a bug\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/owner.(*scheduler).table2CaptureIndex\n\tgithub.com/pingcap/ticdc/cdc/owner/scheduler.go:150\ngithub.com/pingcap/ticdc/cdc/owner.(*scheduler).syncTablesWithCurrentTables\n\tgithub.com/pingcap/ticdc/cdc/owner/scheduler.go:231\ngithub.com/pingcap/ticdc/cdc/owner.(*scheduler).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/scheduler.go:74\ngithub.com/pingcap/ticdc/cdc/owner.(*changefeed).tick\n\tgithub.com/pingcap/ticdc/cdc/owner/changefeed.go:169\ngithub.com/pingcap/ticdc/cdc/owner.(*changefeed).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/changefeed.go:105\ngithub.com/pingcap/ticdc/cdc/owner.(*Owner).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/owner.go:134\ngithub.com/pingcap/ticdc/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/ticdc/pkg/orchestrator/etcd_worker.go:187\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).runEtcdWorker\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:271\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).campaignOwner\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:250\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).run.func2\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:175\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v5.2.1

Upstream TiKV version (execute tikv-server --version):

v5.2.1

TiCDC version (execute cdc version):

v5.2.1

yilongrong avatar Jul 29 '22 03:07 yilongrong

/assign @overvenus @3AceShowHand

nongfushanquan avatar Jul 29 '22 05:07 nongfushanquan

The culprit is that there are two "add a table" operations almost simultaneously (differing only 8ms)

[xxx.xxx.5.147] Jul 27, 2022 @ 10:26:09.620 [owner.go:268] ["add a table"] [tableId=12920] [boundaryTs=434867736286331009]
[xxx.xxx.5.147] Jul 27, 2022 @ 10:26:09.628 [owner.go:268] ["add a table"] [tableId=12920] [boundaryTs=434867736286331009]

I suspect this is caused by the race condition in this LBYL check:

https://github.com/pingcap/tiflow/blob/81c22b1c1b2041e2806160d8c7e1105a70815ff5/cdc/model/owner.go#L263-L267

The question is whether there will really be 2 goroutines running (*TaskStatus).AddTable on the same TaskStatus instance simultaneously.

kennytm avatar Aug 09 '22 05:08 kennytm

(*TaskStatus).AddTable can be called by multiple different callers, it's possible cause by the code snippet above.

3AceShowHand avatar Aug 15 '22 07:08 3AceShowHand

/severity moderate

fubinzh avatar Jun 13 '23 08:06 fubinzh