tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

ticdc lag reached more than 12min when run ha_ticdc(owner)_to_ticdcProcessor(all)_network_partition

Open Lily2025 opened this issue 1 year ago • 7 comments

What did you do?

1、run tpcc with threads 10 and warehouse 1000 2、After 10 minutes, simulates the owner of ticdc is network isolated from all other processor of ticdc fault start time:2023-06-13 08:00:03 3、After 10 minutes, recovery the fault fault recover time:2023-06-13 08:10:04

What did you expect to see?

changefeed lag is less than 30s

What did you see instead?

changefeed lag reached more than 12min after inject fault

image

Versions of the cluster

git hash : 1e2f277f2e3d9b57b15db9a2a9b2c62832c071ca

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

Lily2025 avatar Jun 14 '23 08:06 Lily2025

/remove-area dm /area ticdc

Lily2025 avatar Jun 14 '23 08:06 Lily2025

@fubinzh: The label(s) severity/moderete cannot be applied, because the repository doesn't have them.

In response to this:

/severity moderete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot[bot] avatar Jun 14 '23 09:06 ti-chi-bot[bot]

/severity moderate

fubinzh avatar Jun 14 '23 09:06 fubinzh

@fubinzh: The label(s) severity/moderete cannot be applied, because the repository doesn't have them.

In response to this:

/severity moderete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot[bot] avatar Jun 14 '23 09:06 ti-chi-bot[bot]

The owner and the processor keep themselves alive by watching keys in etcd. Since the network between TiCDC and PD is normal, the TiCDC node will not restart after a network partition among TiCDC nodes. However, the owner will collect table replication progress through P2P messages. After a network partition, the owner cannot receive the table replication progress, resulting in increased lag.

nongfushanquan avatar Jun 15 '23 03:06 nongfushanquan

By design. I suggest we address it in long term.

flowbehappy avatar Apr 08 '24 09:04 flowbehappy

TiCDC functions as a distributed system. The owner collects table progression data from the processor through direct communication, which is used to advance a table's barrierTs. This signals to the processor that data preceding the barrierTs is ready for downstream flushing. So, if there's a network partition between the owner and any processor, it's not possible to advance a changefeed.

asddongmen avatar Apr 30 '24 03:04 asddongmen