Bo Yang comments

Results 48 comments of


                                            Bo Yang

Can Rss have stage retry when one server is down?

You are right that the server should be down and was removed from ZooKeeper after losing heartbeat with ZooKeeper. Current RSS implementation assigns a static list of servers in the...

Can Rss have stage retry when one server is down?

> "spark.shuffle.rss.maxServerCount" and "spark.shuffle.rss.minServerCount" > Thank you for the suggestions @hiboyang ! Does this mean the shuffle data written to the server will be doubled if I set 'spark.shuffle.rss.replicas' to...

Can Rss have stage retry when one server is down?

> Hi, @hiboyang. If the 'spark.shuffle.rss.replicas' does write double size of data to server, we won't be able to use this to large jobs with 400+ TB shuffle data unfortunatly....

How to gracefully retention one RSS server?

This is a good feature, but not implemented in RSS yet. You could contribute for this.

Can Rss have stage retry when one server is down?

Hi @YutingWang98 , you are right! This is the limit in current RSS. Maybe we could brainstorm ideas sometime on how to improve this.

How to gracefully retention one RSS server?

There is high level design doc in this repo (https://github.com/uber/RemoteShuffleService/blob/master/docs/server-high-level-design.md) which describes the concepts of how RSS works. I do not work in Uber any more, thus not actively working...

Can Rss have stage retry when one server is down?

@mayurdb, this is pretty cool, thanks for sharing how you did inside Uber to fix this issue! Is it possible to share the patch you did in Spark code?

spark 3.0

Yeah, these looks related to different maven versions. @altaetran, do you get any errors/exceptions when "unable to get any tasks completed during a test spark run"?

Using remote shuffle service with Spark operator

Oops, just found I replied using my another GitHub account, That @datapunchorg is still me.

close writers during end of stage using onStageCompleted

@tuckerh99, thanks for the PR! Curious whether you run your Spark production workload with shuffle files storing on S3? If so, how is the performance?