tiflow ticdc lag reached more than 12min when run ha_ticdc(owner)_to_ticdcProcessor(all)_network

What did you do?

1、run tpcc with threads 10 and warehouse 1000 2、After 10 minutes, simulates the owner of ticdc is network isolated from all other processor of ticdc fault start time：2023-06-13 08:00:03 3、After 10 minutes, recovery the fault fault recover time：2023-06-13 08:10:04

What did you expect to see?

changefeed lag is less than 30s

What did you see instead?

changefeed lag reached more than 12min after inject fault

Versions of the cluster

git hash : 1e2f277f2e3d9b57b15db9a2a9b2c62832c071ca

current status of DM cluster (execute `query-status <task-name>` in dmctl)

No response

Jun 14 '23 08:06 Lily2025

/remove-area dm /area ticdc

Jun 14 '23 08:06 Lily2025

@fubinzh: The label(s) severity/moderete cannot be applied, because the repository doesn't have them.

In response to this:

/severity moderete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Jun 14 '23 09:06 ti-chi-bot[bot]

/severity moderate

Jun 14 '23 09:06 fubinzh

@fubinzh: The label(s) severity/moderete cannot be applied, because the repository doesn't have them.

In response to this:

/severity moderete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Jun 14 '23 09:06 ti-chi-bot[bot]

The owner and the processor keep themselves alive by watching keys in etcd. Since the network between TiCDC and PD is normal, the TiCDC node will not restart after a network partition among TiCDC nodes. However, the owner will collect table replication progress through P2P messages. After a network partition, the owner cannot receive the table replication progress, resulting in increased lag.

Jun 15 '23 03:06 nongfushanquan

By design. I suggest we address it in long term.

Apr 08 '24 09:04 flowbehappy

TiCDC functions as a distributed system. The owner collects table progression data from the processor through direct communication, which is used to advance a table's barrierTs. This signals to the processor that data preceding the barrierTs is ready for downstream flushing. So, if there's a network partition between the owner and any processor, it's not possible to advance a changefeed.

Apr 30 '24 03:04 asddongmen

tiflow
tiflow copied to clipboard

ticdc lag reached more than 12min when run ha_ticdc(owner)_to_ticdcProcessor(all)_network_partition

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute `query-status <task-name>` in dmctl)

tiflow tiflow copied to clipboard

ticdc lag reached more than 12min when run ha_ticdc(owner)_to_ticdcProcessor(all)_network_partition

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

tiflow
tiflow copied to clipboard

current status of DM cluster (execute `query-status <task-name>` in dmctl)