flink-remote-shuffle icon indicating copy to clipboard operation
flink-remote-shuffle copied to clipboard

Support standby ShuffleManager

Open TanYuxin-tyx opened this issue 3 years ago • 0 comments

Motivation

Currently, the high availability of ShuffleManager depends on the support of external services when it hangs up. In essence, ShuffleManager has a single point problem.

We can introduce one or more standby ShuffleManagers to solve the problem. When active ShuffleManager hangs up occasionally, standby ShuffleManager should automatically switch to the active mode, which could improve the stability and high availability.

Changes

Introduce standby ShuffleManager to improve the stability and high availability.

Test

Unit test. E2E test. Test manually on a cluster.

TanYuxin-tyx avatar Feb 21 '22 08:02 TanYuxin-tyx