tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

Tracking issue for reducing high latency

Open overvenus opened this issue 3 years ago • 1 comments

The status quo and the goal of the tracking issue.

Scenario Status quo (2022/03/02) Goal
Planned outage
TiCDC rolling upgrade 2-8 mins <= 10s
TiKV/TiDB rolling upgrade <= 1min <= 30s
PD rolling upgrade <= 1min <= 1min
TiKV scale out/off <= 1min <= 30s
TiCDC scale out/off 2-8 mins <= 5s
PD scale out/off <= 1min <= 1min
TiDB scale out/off <= 1min <= 30s
Unplaned outage
(less than ⅓ total node number, outage last 5mins)
TiKV/TiDB outage (power down, disk failure etc) 2-4 mins <= 2min
TiCDC  outage (power down, disk failure etc) 2-8 mins 2-8 mins
PD  outage (power down, disk failure etc) 2-8 mins 2-8 mins
No outage
Delay <= 10s <= 2s
Delay spikes (99%) <= 1 min <= 5s

Extreme high latency (>30 min)

  • [x] https://github.com/pingcap/tiflow/issues/4516

Latency spike

  • [x] https://github.com/pingcap/tiflow/issues/4756
  • [x] https://github.com/pingcap/tiflow/issues/4761
  • [x] https://github.com/pingcap/tiflow/issues/4762
  • [x] https://github.com/pingcap/tiflow/issues/3529
  • [x] https://github.com/tikv/tikv/issues/12166
  • [ ] #7277

Two-phase scheduling

Two-phase scheduling aims to solve high latency spike (up to minutes) caused by moving table (move a table from a TiCDC to another).

  • [x] Cleanup scheduler v1 code https://github.com/pingcap/tiflow/pull/5362
  • [x] Cleanup schedulerv2 interfaces in owner and processor https://github.com/pingcap/tiflow/pull/5396
  • [x] Sketch two-phase scheduler (task based scheduling, move table task, add table task, remove table task) https://github.com/pingcap/tiflow/pull/5450
  • [x] Implement two-phase replication_set state transition https://github.com/pingcap/tiflow/pull/5450
  • [x] Implement replicationManager https://github.com/pingcap/tiflow/pull/5562
  • [x] Implement two-phase scheduler coordinator https://github.com/pingcap/tiflow/pull/5710
  • [x] Implement new APIs of two-phase in table executor an table pipeline https://github.com/pingcap/tiflow/pull/5593
  • [x] Implement two-phase processor agent https://github.com/pingcap/tiflow/pull/5593
  • [x] Implement two-phase balance scheduler https://github.com/pingcap/tiflow/pull/5676
  • [x] Implement two-phase API scheduler (move table and rebalance) https://github.com/pingcap/tiflow/pull/5711
  • [x] Checkpoint management refactor (per table based) https://github.com/pingcap/tiflow/pull/5709
  • [x] Support rebalance when adding a new TiCDC node https://github.com/pingcap/tiflow/pull/5760
  • [x] Support drain capture API https://github.com/pingcap/tiflow/pull/5852
  • [x] Add metrics https://github.com/pingcap/tiflow/pull/5823
  • [x] https://github.com/pingcap/tiflow/pull/5844
  • [x] https://github.com/pingcap/tiflow/pull/6196

Graceful shutdown and upgrade

  • [x] https://github.com/pingcap/tiflow/pull/6097
  • [x] Resign ownership during graceful shutdown https://github.com/pingcap/tiflow/pull/6110
  • [x] Move out all tables during graceful shutdown https://github.com/pingcap/tiflow/pull/6110
  • [x] https://github.com/pingcap/tiup/pull/1972
  • [x] https://github.com/pingcap/tidb-operator/issues/4623
  • [x] https://github.com/pingcap/tidb-operator/pull/4647

Cross version grace upgrade

  • [x] #6450

overvenus avatar Mar 03 '22 07:03 overvenus

tiup cluster stop test-cluster cost a long time, looks triggered graceful shutdown, Is this expected?

the command above is to hard stop the whole cluster, I don't think it should perform graceful shutdown.

3AceShowHand avatar Jul 12 '22 13:07 3AceShowHand

With the default settings in v6.5.0, changefeed replication lag is less 2s for both normal scenario and planned rolling restart/upgrade. Except the large table scenario, I think we have met the requirement described in this issue. Let's track the large table scenario in https://github.com/pingcap/tiflow/issues/7720.

overvenus avatar Nov 25 '22 08:11 overvenus