tiflow
tiflow copied to clipboard
drop index after add index finishing on primary cluster, drop index was retried repeatedly by ticdc due to add index was running on secondary and after a time, changefeed status was failed
What did you do?
1、restore data for primary and secondary 2、create changefeed and set bdr role for primary and secondary 3、run sysbench on primary and secondary 4、add index and then drop index when add index finished on primary 5、inject network partition between one of tikv and all other pods, which last for 3mins and recover chaos recover time : 2024-02-29 17:29:24
ticdc logs: endless-ha-test-ticdc-tps-7080582-1-976-tc-ticdc-0.tar.gz
What did you expect to see?
changefeed status are normal after recover fault, ddl can sync success and changefeed lag can recover to normal
What did you see instead?
1、drop index was was retried repeatedly even if drop index was queueing on secondary due to add index was running
primary:
secondary:
2、after a time,changefeed status was failed
changefeed status:
Versions of the cluster
./cdc version Release Version: v8.0.0-alpha Git Commit Hash: 25ce29c2a1802bbb4cd26008f322728959a91f7a Git Branch: heads/refs/tags/v8.0.0-alpha UTC Build Time: 2024-02-27 11:37:29 Go Version: go version go1.21.6 linux/amd64 Failpoint Build: false
current status of DM cluster (execute query-status <task-name>
in dmctl)
No response
/remove-area dm /area ticdc
/assign sdojjy
From the ticdc view, it's not a bug.
In this case, we found that the add index
ddl is hang on TiDB side, and the drop index
ddl must wait for the add index
ddl.
TiCDC set readTimeout to 2m
by default, MySQL connection report invalid connection
after wait for 2m
, then changefeed failed after backoff 20 times.
[mysql] 2024/03/01 12:23:11 packets.go:37: read tcp 192.168.13.151:54646->10.2.12.46:30386: i/o timeout
user can increase the read timeout value in changefeed config to workaround this issue
[sink.mysql-config]
read-timeout=2m
this issue happens in this scenario: 1、downstream is adding index and will take a long time to compete 2、upstream drop index and sync to downstream,drop index was retried repeatedly on downstream 3、changefeed may fail after retry 20 times
This might reuslt in changefeed failure, if add index followed by another DDL (not only drop index) executed, and the add index not able to finish in downstream with 20 retries. /severity moderate
I suggest that, after a DDL retry has failed because of timeout, we check whether there is any running DDL jobs in downstream. If true, instead of scheduling the 3rd retry we poll until the DDL job queue is empty.
Polling after a retry failure allows us to proceed after the "successful" retry of ADD INDEX, while prevent the DROP INDEX from retrying 20 times.
This should be considered an optimization and does not necessary work in all cases, including:
- when ADMIN SHOW DDL; cannot be executed (e.g. permission issue)
- when downstream is not TiDB (i.e. MySQL / MariaDB, unless they also have an equivalent of ADMIN SHOW DDL)