Bo Yang
Bo Yang
Hi @Lobo2008, this is good testing! Looks like RSS needs to support this scenario where rootDir becomes unavailable. RSS client should mark that server as failed in that case and...
By the way @Lobo2008 , want to double check, would you expand `+details` for the first block of exceptions to see whether there is more clue?
Thanks @Lobo2008 for the debugging info! I checked the source code again. The [code](https://github.com/uber/RemoteShuffleService/blob/7220c23694e0175e01719621707680a2718173cf/src/main/java/com/uber/rss/clients/ReplicatedWriteClient.java#L145) in RSS is supposed to try another server if hitting error with one server including disk...
Yes, if that replicas setting not work for you. Another option: you could use `spark.shuffle.rss.excludeHosts` setting to exclude that server with bad disk.
Hi @Lobo2008, if I remember correctly, yes, the 501GB will be kept for 36 hours according to DEFAULT_APP_FILE_RETENTION_MILLIS(default 36h). The reason is Spark application needs shuffle files from previous stage...
Hi @Lobo2008, you are right. It could track the stage dependency and clean up stage shuffle files selectively. Need someone to work on this :)
RSS cannot use multiple disks so far, since it can only be configured using one directory. Again, this part could be changed as well with contribution welcome. If disk is...
Previously RSS was not tested much with Spark 3.1/3.2 and Adaptive Query Execution (AQE). The code looks having bug. Would love to see someone debug further there.
Yeah, agree it is confusing here. Spark 3.1 and 3.2 have slight difference in shuffle APIs, thus we need to change Remote Shuffle Service accordingly. I used to work on...
I see. In that case, you could change 2.4.3 in pom.xml to Spark 3 version. You will get some compile error, and you could start from there. I tried to...