incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

Introduce startup-silent-period mechanism to avoid partial assignments

Open zuston opened this issue 3 years ago • 4 comments

What changes were proposed in this pull request?

Introduce startup-silent-period mechanism to avoid partial assignments

Why are the changes needed?

When changing some coordinator's conf and then restart, coordinator will accept client getAssignment request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.

I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

UTs

zuston avatar Sep 28 '22 05:09 zuston

PTAL @jerqi

zuston avatar Sep 28 '22 05:09 zuston

Codecov Report

Merging #247 (63af987) into master (c89f95c) will increase coverage by 0.02%. The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master     #247      +/-   ##
============================================
+ Coverage     59.16%   59.18%   +0.02%     
- Complexity     1340     1343       +3     
============================================
  Files           163      163              
  Lines          8810     8837      +27     
  Branches        833      835       +2     
============================================
+ Hits           5212     5230      +18     
- Misses         3332     3339       +7     
- Partials        266      268       +2     
Impacted Files Coverage Δ
...he/uniffle/coordinator/CoordinatorGrpcService.java 2.31% <0.00%> (-0.03%) :arrow_down:
...ache/uniffle/coordinator/SimpleClusterManager.java 86.61% <53.33%> (-4.46%) :arrow_down:
...rg/apache/uniffle/coordinator/CoordinatorConf.java 97.26% <100.00%> (+0.20%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Sep 28 '22 06:09 codecov-commenter

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

jerqi avatar Sep 29 '22 03:09 jerqi

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

No such config in Yarn. Safe mode is for HDFS. I think it's not proper in Uniffle scenarios

Do u have any ideas?

zuston avatar Sep 29 '22 03:09 zuston

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

No such config in Yarn. Safe mode is for HDFS. I think it's not proper in Uniffle scenarios

Do u have any ideas?

It's ok for Uniffle. I don't have another good ideas.

jerqi avatar Oct 09 '22 06:10 jerqi

It's ok for Uniffle. I don't have another good ideas.

Got it. I will rename to safemode.

zuston avatar Oct 09 '22 06:10 zuston

It's ok for Uniffle. I don't have another good ideas.

Got it. I will rename to safemode.

I means that startup-silent-period is ok.

jerqi avatar Oct 09 '22 07:10 jerqi

OK. Do u have any other advice? @jerqi

zuston avatar Oct 09 '22 08:10 zuston

Updated.

  1. Add the doc of these configs
  2. Explain why make the startup-slient-period stop default.

zuston avatar Oct 10 '22 02:10 zuston

Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.998 s <<< FAILURE! - in org.apache.uniffle.test.CoordinatorGrpcTest
Error:  rpcMetricsTest  Time elapsed: 1.035 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <0.0> but was: <1.0>

CI failed. https://github.com/apache/incubator-uniffle/actions/runs/3216468616/jobs/5258387024 #244

Rerun it.

zuston avatar Oct 10 '22 02:10 zuston

Unrelated flaky test failed, I will rerun the flaky test

jerqi avatar Oct 10 '22 02:10 jerqi