tiflow
tiflow copied to clipboard
Tracking issue for reducing high latency
The status quo and the goal of the tracking issue.
| Scenario | Status quo (2022/03/02) | Goal |
|---|---|---|
| Planned outage | ||
| TiCDC rolling upgrade | 2-8 mins | <= 10s |
| TiKV/TiDB rolling upgrade | <= 1min | <= 30s |
| PD rolling upgrade | <= 1min | <= 1min |
| TiKV scale out/off | <= 1min | <= 30s |
| TiCDC scale out/off | 2-8 mins | <= 5s |
| PD scale out/off | <= 1min | <= 1min |
| TiDB scale out/off | <= 1min | <= 30s |
| Unplaned outage (less than ⅓ total node number, outage last 5mins) |
||
| TiKV/TiDB outage (power down, disk failure etc) | 2-4 mins | <= 2min |
| TiCDC outage (power down, disk failure etc) | 2-8 mins | 2-8 mins |
| PD outage (power down, disk failure etc) | 2-8 mins | 2-8 mins |
| No outage | ||
| Delay | <= 10s | <= 2s |
| Delay spikes (99%) | <= 1 min | <= 5s |
Extreme high latency (>30 min)
- [x] https://github.com/pingcap/tiflow/issues/4516
Latency spike
- [x] https://github.com/pingcap/tiflow/issues/4756
- [x] https://github.com/pingcap/tiflow/issues/4761
- [x] https://github.com/pingcap/tiflow/issues/4762
- [x] https://github.com/pingcap/tiflow/issues/3529
- [x] https://github.com/tikv/tikv/issues/12166
- [ ] #7277
Two-phase scheduling
Two-phase scheduling aims to solve high latency spike (up to minutes) caused by moving table (move a table from a TiCDC to another).
- [x] Cleanup scheduler v1 code https://github.com/pingcap/tiflow/pull/5362
- [x] Cleanup schedulerv2 interfaces in owner and processor https://github.com/pingcap/tiflow/pull/5396
- [x] Sketch two-phase scheduler (task based scheduling, move table task, add table task, remove table task) https://github.com/pingcap/tiflow/pull/5450
- [x] Implement two-phase replication_set state transition https://github.com/pingcap/tiflow/pull/5450
- [x] Implement replicationManager https://github.com/pingcap/tiflow/pull/5562
- [x] Implement two-phase scheduler coordinator https://github.com/pingcap/tiflow/pull/5710
- [x] Implement new APIs of two-phase in table executor an table pipeline https://github.com/pingcap/tiflow/pull/5593
- [x] Implement two-phase processor agent https://github.com/pingcap/tiflow/pull/5593
- [x] Implement two-phase balance scheduler https://github.com/pingcap/tiflow/pull/5676
- [x] Implement two-phase API scheduler (move table and rebalance) https://github.com/pingcap/tiflow/pull/5711
- [x] Checkpoint management refactor (per table based) https://github.com/pingcap/tiflow/pull/5709
- [x] Support rebalance when adding a new TiCDC node https://github.com/pingcap/tiflow/pull/5760
- [x] Support drain capture API https://github.com/pingcap/tiflow/pull/5852
- [x] Add metrics https://github.com/pingcap/tiflow/pull/5823
- [x] https://github.com/pingcap/tiflow/pull/5844
- [x] https://github.com/pingcap/tiflow/pull/6196
Graceful shutdown and upgrade
- [x] https://github.com/pingcap/tiflow/pull/6097
- [x] Resign ownership during graceful shutdown https://github.com/pingcap/tiflow/pull/6110
- [x] Move out all tables during graceful shutdown https://github.com/pingcap/tiflow/pull/6110
- [x] https://github.com/pingcap/tiup/pull/1972
- [x] https://github.com/pingcap/tidb-operator/issues/4623
- [x] https://github.com/pingcap/tidb-operator/pull/4647
Cross version grace upgrade
- [x] #6450
tiup cluster stop test-cluster cost a long time, looks triggered graceful shutdown, Is this expected?
the command above is to hard stop the whole cluster, I don't think it should perform graceful shutdown.
With the default settings in v6.5.0, changefeed replication lag is less 2s for both normal scenario and planned rolling restart/upgrade. Except the large table scenario, I think we have met the requirement described in this issue. Let's track the large table scenario in https://github.com/pingcap/tiflow/issues/7720.