Bo Yang

Results 48 comments of Bo Yang

You are right that the server should be down and was removed from ZooKeeper after losing heartbeat with ZooKeeper. Current RSS implementation assigns a static list of servers in the...

> "spark.shuffle.rss.maxServerCount" and "spark.shuffle.rss.minServerCount" > Thank you for the suggestions @hiboyang ! Does this mean the shuffle data written to the server will be doubled if I set 'spark.shuffle.rss.replicas' to...

> Hi, @hiboyang. If the 'spark.shuffle.rss.replicas' does write double size of data to server, we won't be able to use this to large jobs with 400+ TB shuffle data unfortunatly....

This is a good feature, but not implemented in RSS yet. You could contribute for this.

Hi @YutingWang98 , you are right! This is the limit in current RSS. Maybe we could brainstorm ideas sometime on how to improve this.

There is high level design doc in this repo (https://github.com/uber/RemoteShuffleService/blob/master/docs/server-high-level-design.md) which describes the concepts of how RSS works. I do not work in Uber any more, thus not actively working...

@mayurdb, this is pretty cool, thanks for sharing how you did inside Uber to fix this issue! Is it possible to share the patch you did in Spark code?

Yeah, these looks related to different maven versions. @altaetran, do you get any errors/exceptions when "unable to get any tasks completed during a test spark run"?

Oops, just found I replied using my another GitHub account, That @datapunchorg is still me.

@tuckerh99, thanks for the PR! Curious whether you run your Spark production workload with shuffle files storing on S3? If so, how is the performance?