incubator-uniffle Fix: Improve shuffle unregister timeouts

What changes were proposed in this pull request?

Use the spark.rss.client.unregister.request.timeout.sec timeout for the individual GRPC calls and introduce an overall spark.rss.client.unregister.timeout.sec.

Why are the changes needed?

When unregistering with many shuffle servers and a relative small task pool, there will be requests that are executed sequentially, but the current per request timeout is applied to all requests together.

Example: 100 shuffle servers, 10 threads, completion of all tasks currently times out after 10 seconds, giving each request only 1 second (though per request timeout is 10s).

Fix: # (issue)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Not tested.

May 23 '24 09:05 EnricoMi

Codecov Report

Attention: Patch coverage is 39.39394% with 40 lines in your changes are missing coverage. Please review.

Project coverage is 53.39%. Comparing base (6f6d35a) to head (4a84b98). Report is 34 commits behind head on master.

Files	Patch %	Lines
...he/uniffle/client/impl/ShuffleWriteClientImpl.java	51.11%	19 Missing and 3 partials :warning:
...ffle/client/impl/grpc/ShuffleServerGrpcClient.java	0.00%	12 Missing :warning:
...nt/request/RssUnregisterShuffleByAppIdRequest.java	0.00%	3 Missing :warning:
...le/client/request/RssUnregisterShuffleRequest.java	0.00%	3 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1738      +/-   ##
============================================
- Coverage     54.86%   53.39%   -1.47%     
+ Complexity     2358     2201     -157     
============================================
  Files           368      354      -14     
  Lines         16379    15230    -1149     
  Branches       1504     1386     -118     
============================================
- Hits           8986     8132     -854     
+ Misses         6862     6617     -245     
+ Partials        531      481      -50

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

May 23 '24 09:05 codecov-commenter

Test Results

2 433 files ± 0 2 433 suites ±0 4h 59m 58s :stopwatch: - 1m 28s 936 tests + 2 935 :white_check_mark: + 2 1 :zzz: ±0 0 :x: ±0 10 856 runs +28 10 842 :white_check_mark: +28 14 :zzz: ±0 0 :x: ±0

Results for commit 3c99f121. ± Comparison against base commit 42933f47.

:recycle: This comment has been updated with latest results.

May 23 '24 10:05 github-actions[bot]

This PR is more reasonable. But I'm worried about the new total timeout seconds being too large, which will cause the driver too long to stop, will it?

May 23 '24 10:05 rickyma

This PR is more reasonable. But I'm worried about the new total timeout seconds being too large, which will cause the driver too long to stop, will it?

Right, but I would say this is a consequence of the configuration (timeout per shuffle service and concurrency).

Users can increase concurrency easily. These threads are cheap as they primarily wait for the response.

May 23 '24 10:05 EnricoMi

which will cause the driver too long to stop

only if individual requests take very long or timeout...

May 23 '24 10:05 EnricoMi

only if individual requests take very long or timeout...

Yeah, should we add an upper total timeout for this situation? e.g. the total timeout <= 3min, maybe something like this?

May 23 '24 11:05 rickyma

Alright, let's introduce spark.rss.client.unregister.timeout.sec, which replaces what spark.rss.client.unregister.request.timeout.sec did, and have spark.rss.client.unregister.request.timeout.sec apply to an individual request (as the name implies).

Let's further log a warning if spark.rss.client.unregister.timeout.sec is too small.

Unfortunately, configuring the relevant values is only supported by Spark clients, so the warning will be pointless for other clients.

May 23 '24 13:05 EnricoMi

So I am not sure why we are seeing those interrupt exceptions. Either these indicate timeouts or some other cause interrupt.

Also, the reasons are still unclear. We need to know what caused this.

May 23 '24 14:05 rickyma

So I am not sure why we are seeing those interrupt exceptions. Either these indicate timeouts or some other cause interrupt.

Also, the reasons are still unclear. We need to know what caused this.

@rickyma I have found the root cause:

https://github.com/apache/incubator-uniffle/blob/eb164a8fa79cf70dcdbb13774686663cf11bc8af/client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java#L977-L995

Method ThreadUtils.executeTasks is called with a timeout of unregisterRequestTimeSec, clearly in unit of seconds, whereas ThreadUtils.executeTasks expects milliseconds. Therefore, executeTasks returns after 10 ms (default) and ongoing tasks are killed.

Fixed in #1766.

Jun 04 '24 07:06 EnricoMi

There are still conflicts left to be resolved.

Jun 04 '24 12:06 rickyma

Any progress? ping ~ @EnricoMi @zuston

Jun 11 '24 08:06 rickyma

Any progress? ping ~ @EnricoMi @zuston

LGTM . PTAL @rickyma

Jun 11 '24 11:06 zuston

LGTM

Jun 11 '24 11:06 rickyma

@zuston ready for approval?

Jun 12 '24 05:06 EnricoMi

Merged. Thanks @EnricoMi @rickyma

Jun 12 '24 05:06 zuston