RemoteShuffleService icon indicating copy to clipboard operation
RemoteShuffleService copied to clipboard

what may cause RssInvalidServerVersionException?

Open Lobo2008 opened this issue 2 years ago • 2 comments

Hi, I am wondering:

Q1. if RssInvalidServerVersionException will occur when RSS-i is restarted by a shell script as soon as it crashes due to some reasons meanwhile some applications are still using it. clients still stores the former RSS-i version but actually the version of the newly registered RSS-i is already changed.

# also the other exception may be caused by the same reason?
org.apache.spark.shuffle.FetchFailedException: Detected server restart, current server: Server{rss04.xxx:12203, 1675897753258, rss04xxx:/data/}, previous server: Server{rss04.xxxx:12203, 1675895945858, rss04xxx:/data/} at org.apache.spark.shuffle.RssShuffleManager$$anon$2.resolveConnection(RssShuffleManager.scala:220) at com.uber.rss.clients.ServerConnectionCacheUpdateRefresher.refreshConnection(ServerConnectionCacheUpdateRefresher.java:49) at com.uber.rss.clients.ServerIdAwareSyncWriteClient.connectImpl(ServerIdAwareSyncWriteClient.java:133) at

Q2. What may cause this exception :

org.apache.spark.shuffle.FetchFailedException: Cannot fetch shuffle 0 partition 362 due to RssAggregateException (RssShuffleStageNotStartedException (Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxx44973 -> /10.20xxx:12212 (1xxxx28)])
com.uber.rss.exceptions.RssShuffleStageNotStartedException: Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxxx:44973 -> /10.2xxx12212 (10.xxxx)]
	at com.uber.rss.clients.ClientBase.checkOKResponseStatus(ClientBase.java:291)
	at com.uber.rss.clients.ClientBase.readResponseStatus(ClientBase.java:275)
	at ...

Lobo2008 avatar Feb 08 '23 12:02 Lobo2008

Q1 You are right. This happened because server restarted and client had initially connected to earlier server. Ideally should not be an issue. Maybe we can remove this check @hiboyang ?

Q2 That basically means the server you are trying to connect to has not yet received the shuffle data for corresponding partition (Identified using appId, appAttemptId, shuffleId). Is this also happening when the server restarted?

mayurdb avatar Mar 20 '23 06:03 mayurdb

Previously RSS does not handle server restart well, thus adding those check. Feel we could remove it.

hiboyang avatar May 03 '23 16:05 hiboyang