[FEATURE] Support stateful upgrade for shuffle-server more quickly
I think the quick stateful upgrade for shuffle-server is meaningful, especially for a big uniffle cluster. The current graceful decommission mechanism by exclude node file in coordinator side is useful, but it will cost too much time. In our internal environment, this will cost 1 week for 400+ nodes cluster. Anyway, it was a burden for ops people.
So it's time to make this feature involved in the next or future uniffle version, all write-read process should be taken into consideration in detail currently although I made effort before like https://github.com/apache/incubator-uniffle/pull/308
the following problems need to be considered
For reading client
If the client is reading the data from one shuffle-server, but the shuffle-server is offline, it should fast fail. This is OK for the normal process.
But when server is upgrading, the reader client should be able to tolerate waiting for a while once it accept the signal from the server, like QUICK_STATEFUL_UPGRADE state.
For writing client
When writing for the client and the shuffle-server is upgrading, it should accept the signal from the shuffle-server, and then require new server to store partitioned data. This mechanism is called HARD_SPLIT in celeborn. And I think we also could achieve it by extending the dynamic assignment for one partition rather than whole stage like #308 did.
Shuffle-server's shuffle data in memory
Before shutdown, all the shuffle-data in memory should be persisted into persistent storage like localfile. The cost time depends on the localfile IO speed.
For example, the memory size is 100G, the shuffle-server has 4 SSDs(500M/s), it will cost 50s. I think it's acceptable. But for HDD, the time will be longer, and the more disks is needed if you want to speed up.
Shuffle-server's meta data
Currently, the blockids and partition's disk mapping are stored in memory of server, which also should be recovered when restarting.
Are you interested on this feature? @summaryzb
cc @jerqi @smallzhongfeng @leixm @xianjingfeng @zhengchenyu @advancedxy
I'm in, Please assign this to me