incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[FEATURE] Support stateful upgrade for shuffle-server more quickly

Open zuston opened this issue 2 years ago • 2 comments

I think the quick stateful upgrade for shuffle-server is meaningful, especially for a big uniffle cluster. The current graceful decommission mechanism by exclude node file in coordinator side is useful, but it will cost too much time. In our internal environment, this will cost 1 week for 400+ nodes cluster. Anyway, it was a burden for ops people.

So it's time to make this feature involved in the next or future uniffle version, all write-read process should be taken into consideration in detail currently although I made effort before like https://github.com/apache/incubator-uniffle/pull/308

the following problems need to be considered

For reading client

If the client is reading the data from one shuffle-server, but the shuffle-server is offline, it should fast fail. This is OK for the normal process.

But when server is upgrading, the reader client should be able to tolerate waiting for a while once it accept the signal from the server, like QUICK_STATEFUL_UPGRADE state.

For writing client

When writing for the client and the shuffle-server is upgrading, it should accept the signal from the shuffle-server, and then require new server to store partitioned data. This mechanism is called HARD_SPLIT in celeborn. And I think we also could achieve it by extending the dynamic assignment for one partition rather than whole stage like #308 did.

Shuffle-server's shuffle data in memory

Before shutdown, all the shuffle-data in memory should be persisted into persistent storage like localfile. The cost time depends on the localfile IO speed.

For example, the memory size is 100G, the shuffle-server has 4 SSDs(500M/s), it will cost 50s. I think it's acceptable. But for HDD, the time will be longer, and the more disks is needed if you want to speed up.

Shuffle-server's meta data

Currently, the blockids and partition's disk mapping are stored in memory of server, which also should be recovered when restarting.

zuston avatar Nov 10 '23 03:11 zuston

Are you interested on this feature? @summaryzb

cc @jerqi @smallzhongfeng @leixm @xianjingfeng @zhengchenyu @advancedxy

zuston avatar Nov 10 '23 03:11 zuston

I'm in, Please assign this to me

summaryzb avatar Nov 13 '23 13:11 summaryzb