incubator-uniffle [FEATURE] Support stage recompute for Spark clients

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Search before asking

[X] I have searched in the issues and found no similar issues.

Describe the feature

Once Uniffle's shuffle data read fails, the spark client has its chance to recompute the whole stage.

Motivation

In a distributed cluster with large number of shuffler servers, it's common for some node to go down, such as:

node crash/maintenance due to hardware failures or security patches
Pod eviction if deployed in a k8s environment
vm/spot instance eviction if deployed in a cloud environment

Uniffle already has provide a mechanism to overcome this issue: the quorum protocol. But it requires multiple replica of the same shuffle data, which increases the network traffic and memory pressure on shuffle server. And the E2E performance may be degraded due to the replication.

I'd like to provide a new way to alleviate the potential node failure(in rare chance). Once the whole stage could be recompute, the Spark App could be resilient to shuffle server node failure.

Describe the solution

TBD.

A design doc would be added later

Additional context

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Jan 12 '23 12:01 advancedxy

I wrote up a design doc for this issue: https://docs.google.com/document/d/1OGswqDDQ52rpw5Lat-FpEGDfX1T6EFXYLC7L0G5O4WE/edit?usp=sharing

@zuston @xianjingfeng @jerqi could you mind do some design review?

Feb 13 '23 07:02 advancedxy

@advancedxy I have some comments in the design doc. Please grant me 'comment' permission.

Feb 15 '23 06:02 jiafuzha

@advancedxy I have some comments in the design doc. Please grant me 'comment' permission.

Changed the default permission from viewer to commenter. Please try to refresh it and check whether you have comment permission or not.

I thought viewer already has the comment permission

Feb 15 '23 08:02 advancedxy

@YutingWang98 Uniffle is a remote shuffle service. We have supported stage retry. Maybe you have interest about it.

Jul 17 '23 16:07 jerqi

What is remaining on this particular issue Spark side and are there any docs on how this is enabled?

Aug 24 '23 09:08 connorlwilkes

What is remaining on this particular issue Spark side and are there any docs on how this is enabled?

Wait for the pr https://github.com/apache/incubator-uniffle/pull/1129

Aug 24 '23 09:08 jerqi