flink-remote-shuffle
flink-remote-shuffle copied to clipboard
Support standby ShuffleManager
Motivation
Currently, the high availability of ShuffleManager depends on the support of external services when it hangs up. In essence, ShuffleManager has a single point problem.
We can introduce one or more standby ShuffleManagers to solve the problem. When active ShuffleManager hangs up occasionally, standby ShuffleManager should automatically switch to the active mode, which could improve the stability and high availability.
Changes
Introduce standby ShuffleManager to improve the stability and high availability.
Test
Unit test. E2E test. Test manually on a cluster.