tiflow
tiflow copied to clipboard
ticdc lag reached more than 12min when run ha_ticdc(owner)_to_ticdcProcessor(all)_network_partition
What did you do?
1、run tpcc with threads 10 and warehouse 1000 2、After 10 minutes, simulates the owner of ticdc is network isolated from all other processor of ticdc fault start time:2023-06-13 08:00:03 3、After 10 minutes, recovery the fault fault recover time:2023-06-13 08:10:04
What did you expect to see?
changefeed lag is less than 30s
What did you see instead?
changefeed lag reached more than 12min after inject fault
Versions of the cluster
git hash : 1e2f277f2e3d9b57b15db9a2a9b2c62832c071ca
current status of DM cluster (execute query-status <task-name>
in dmctl)
No response
/remove-area dm /area ticdc
@fubinzh: The label(s) severity/moderete
cannot be applied, because the repository doesn't have them.
In response to this:
/severity moderete
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.
/severity moderate
@fubinzh: The label(s) severity/moderete
cannot be applied, because the repository doesn't have them.
In response to this:
/severity moderete
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.
The owner and the processor keep themselves alive by watching keys in etcd. Since the network between TiCDC and PD is normal, the TiCDC node will not restart after a network partition among TiCDC nodes. However, the owner will collect table replication progress through P2P messages. After a network partition, the owner cannot receive the table replication progress, resulting in increased lag.
By design. I suggest we address it in long term.
TiCDC functions as a distributed system. The owner collects table progression data from the processor through direct communication, which is used to advance a table's barrierTs. This signals to the processor that data preceding the barrierTs is ready for downstream flushing. So, if there's a network partition between the owner and any processor, it's not possible to advance a changefeed.