tiflow Tracking issue for reducing high latency

The status quo and the goal of the tracking issue.

Scenario	Status quo (2022/03/02)	Goal
Planned outage
TiCDC rolling upgrade	2-8 mins	<= 10s
TiKV/TiDB rolling upgrade	<= 1min	<= 30s
PD rolling upgrade	<= 1min	<= 1min
TiKV scale out/off	<= 1min	<= 30s
TiCDC scale out/off	2-8 mins	<= 5s
PD scale out/off	<= 1min	<= 1min
TiDB scale out/off	<= 1min	<= 30s
Unplaned outage (less than ⅓ total node number, outage last 5mins)
TiKV/TiDB outage (power down, disk failure etc)	2-4 mins	<= 2min
TiCDC outage (power down, disk failure etc)	2-8 mins	2-8 mins
PD outage (power down, disk failure etc)	2-8 mins	2-8 mins
No outage
Delay	<= 10s	<= 2s
Delay spikes (99%)	<= 1 min	<= 5s

Extreme high latency (>30 min)

[x] https://github.com/pingcap/tiflow/issues/4516

Latency spike

[x] https://github.com/pingcap/tiflow/issues/4756
[x] https://github.com/pingcap/tiflow/issues/4761
[x] https://github.com/pingcap/tiflow/issues/4762
[x] https://github.com/pingcap/tiflow/issues/3529
[x] https://github.com/tikv/tikv/issues/12166
[ ] #7277

Two-phase scheduling

Two-phase scheduling aims to solve high latency spike (up to minutes) caused by moving table (move a table from a TiCDC to another).

[x] Cleanup scheduler v1 code https://github.com/pingcap/tiflow/pull/5362
[x] Cleanup schedulerv2 interfaces in owner and processor https://github.com/pingcap/tiflow/pull/5396
[x] Sketch two-phase scheduler (task based scheduling, move table task, add table task, remove table task) https://github.com/pingcap/tiflow/pull/5450
[x] Implement two-phase replication_set state transition https://github.com/pingcap/tiflow/pull/5450
[x] Implement replicationManager https://github.com/pingcap/tiflow/pull/5562
[x] Implement two-phase scheduler coordinator https://github.com/pingcap/tiflow/pull/5710
[x] Implement new APIs of two-phase in table executor an table pipeline https://github.com/pingcap/tiflow/pull/5593
[x] Implement two-phase processor agent https://github.com/pingcap/tiflow/pull/5593
[x] Implement two-phase balance scheduler https://github.com/pingcap/tiflow/pull/5676
[x] Implement two-phase API scheduler (move table and rebalance) https://github.com/pingcap/tiflow/pull/5711
[x] Checkpoint management refactor (per table based) https://github.com/pingcap/tiflow/pull/5709
[x] Support rebalance when adding a new TiCDC node https://github.com/pingcap/tiflow/pull/5760
[x] Support drain capture API https://github.com/pingcap/tiflow/pull/5852
[x] Add metrics https://github.com/pingcap/tiflow/pull/5823
[x] https://github.com/pingcap/tiflow/pull/5844
[x] https://github.com/pingcap/tiflow/pull/6196

Graceful shutdown and upgrade

[x] https://github.com/pingcap/tiflow/pull/6097
[x] Resign ownership during graceful shutdown https://github.com/pingcap/tiflow/pull/6110
[x] Move out all tables during graceful shutdown https://github.com/pingcap/tiflow/pull/6110
[x] https://github.com/pingcap/tiup/pull/1972
[x] https://github.com/pingcap/tidb-operator/issues/4623
[x] https://github.com/pingcap/tidb-operator/pull/4647

Cross version grace upgrade

[x] #6450

Mar 03 '22 07:03 overvenus

tiup cluster stop test-cluster cost a long time, looks triggered graceful shutdown, Is this expected?

the command above is to hard stop the whole cluster, I don't think it should perform graceful shutdown.

Jul 12 '22 13:07 3AceShowHand

With the default settings in v6.5.0, changefeed replication lag is less 2s for both normal scenario and planned rolling restart/upgrade. Except the large table scenario, I think we have met the requirement described in this issue. Let's track the large table scenario in https://github.com/pingcap/tiflow/issues/7720.

Nov 25 '22 08:11 overvenus

tiflow tiflow copied to clipboard

Tracking issue for reducing high latency

Extreme high latency (>30 min)

Latency spike

Two-phase scheduling

Graceful shutdown and upgrade

Cross version grace upgrade

tiflow
tiflow copied to clipboard