Bo Yang comments

Results 48 comments of


                                            Bo Yang

Disk damage causes failure

Hi @Lobo2008, this is good testing! Looks like RSS needs to support this scenario where rootDir becomes unavailable. RSS client should mark that server as failed in that case and...

Disk damage causes failure

By the way @Lobo2008 , want to double check, would you expand `+details` for the first block of exceptions to see whether there is more clue?

Thanks @Lobo2008 for the debugging info! I checked the source code again. The [code](https://github.com/uber/RemoteShuffleService/blob/7220c23694e0175e01719621707680a2718173cf/src/main/java/com/uber/rss/clients/ReplicatedWriteClient.java#L145) in RSS is supposed to try another server if hitting error with one server including disk...

Disk damage causes failure

Yes, if that replicas setting not work for you. Another option: you could use `spark.shuffle.rss.excludeHosts` setting to exclude that server with bad disk.

How long the shuffle data of each ShuffleStage will be stored in RSS nodes?

Hi @Lobo2008, if I remember correctly, yes, the 501GB will be kept for 36 hours according to DEFAULT_APP_FILE_RETENTION_MILLIS(default 36h). The reason is Spark application needs shuffle files from previous stage...

How long the shuffle data of each ShuffleStage will be stored in RSS nodes?

Hi @Lobo2008, you are right. It could track the stage dependency and clean up stage shuffle files selectively. Need someone to work on this :)

How long the shuffle data of each ShuffleStage will be stored in RSS nodes?

RSS cannot use multiple disks so far, since it can only be configured using one directory. Again, this part could be changed as well with contribution welcome. If disk is...

Spark 3.1/3.2 failed sql skew and local reader tests

Previously RSS was not tested much with Spark 3.1/3.2 and Adaptive Query Execution (AQE). The code looks having bug. Would love to see someone debug further there.

spark 3.1/3.2?

Yeah, agree it is confusing here. Spark 3.1 and 3.2 have slight difference in shuffle APIs, thus we need to change Remote Shuffle Service accordingly. I used to work on...

spark 3.1/3.2?

I see. In that case, you could change 2.4.3 in pom.xml to Spark 3 version. You will get some compile error, and you could start from there. I tried to...