Fix: Improve shuffle unregister timeouts
What changes were proposed in this pull request?
Use the spark.rss.client.unregister.request.timeout.sec timeout for the individual GRPC calls and introduce an overall spark.rss.client.unregister.timeout.sec.
Why are the changes needed?
When unregistering with many shuffle servers and a relative small task pool, there will be requests that are executed sequentially, but the current per request timeout is applied to all requests together.
Example: 100 shuffle servers, 10 threads, completion of all tasks currently times out after 10 seconds, giving each request only 1 second (though per request timeout is 10s).
Fix: # (issue)
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Not tested.
Codecov Report
Attention: Patch coverage is 39.39394% with 40 lines in your changes are missing coverage. Please review.
Project coverage is 53.39%. Comparing base (
6f6d35a) to head (4a84b98). Report is 34 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #1738 +/- ##
============================================
- Coverage 54.86% 53.39% -1.47%
+ Complexity 2358 2201 -157
============================================
Files 368 354 -14
Lines 16379 15230 -1149
Branches 1504 1386 -118
============================================
- Hits 8986 8132 -854
+ Misses 6862 6617 -245
+ Partials 531 481 -50
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Test Results
2 433 files ± 0 2 433 suites ±0 4h 59m 58s :stopwatch: - 1m 28s 936 tests + 2 935 :white_check_mark: + 2 1 :zzz: ±0 0 :x: ±0 10 856 runs +28 10 842 :white_check_mark: +28 14 :zzz: ±0 0 :x: ±0
Results for commit 3c99f121. ± Comparison against base commit 42933f47.
:recycle: This comment has been updated with latest results.
This PR is more reasonable. But I'm worried about the new total timeout seconds being too large, which will cause the driver too long to stop, will it?
This PR is more reasonable. But I'm worried about the new total timeout seconds being too large, which will cause the driver too long to stop, will it?
Right, but I would say this is a consequence of the configuration (timeout per shuffle service and concurrency).
Users can increase concurrency easily. These threads are cheap as they primarily wait for the response.
which will cause the driver too long to stop
only if individual requests take very long or timeout...
only if individual requests take very long or timeout...
Yeah, should we add an upper total timeout for this situation? e.g. the total timeout <= 3min, maybe something like this?
Alright, let's introduce spark.rss.client.unregister.timeout.sec, which replaces what spark.rss.client.unregister.request.timeout.sec did, and have spark.rss.client.unregister.request.timeout.sec apply to an individual request (as the name implies).
Let's further log a warning if spark.rss.client.unregister.timeout.sec is too small.
Unfortunately, configuring the relevant values is only supported by Spark clients, so the warning will be pointless for other clients.
So I am not sure why we are seeing those interrupt exceptions. Either these indicate timeouts or some other cause interrupt.
Also, the reasons are still unclear. We need to know what caused this.
So I am not sure why we are seeing those interrupt exceptions. Either these indicate timeouts or some other cause interrupt.
Also, the reasons are still unclear. We need to know what caused this.
@rickyma I have found the root cause:
https://github.com/apache/incubator-uniffle/blob/eb164a8fa79cf70dcdbb13774686663cf11bc8af/client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java#L977-L995
Method ThreadUtils.executeTasks is called with a timeout of unregisterRequestTimeSec, clearly in unit of seconds, whereas ThreadUtils.executeTasks expects milliseconds. Therefore, executeTasks returns after 10 ms (default) and ongoing tasks are killed.
Fixed in #1766.
There are still conflicts left to be resolved.
Any progress? ping ~ @EnricoMi @zuston
Any progress? ping ~ @EnricoMi @zuston
LGTM . PTAL @rickyma
LGTM
@zuston ready for approval?
Merged. Thanks @EnricoMi @rickyma