incubator-uniffle
incubator-uniffle copied to clipboard
Introduce startup-silent-period mechanism to avoid partial assignments
What changes were proposed in this pull request?
Introduce startup-silent-period mechanism to avoid partial assignments
Why are the changes needed?
When changing some coordinator's conf and then restart, coordinator will accept client getAssignment request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.
I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
UTs
PTAL @jerqi
Codecov Report
Merging #247 (63af987) into master (c89f95c) will increase coverage by
0.02%. The diff coverage is66.66%.
@@ Coverage Diff @@
## master #247 +/- ##
============================================
+ Coverage 59.16% 59.18% +0.02%
- Complexity 1340 1343 +3
============================================
Files 163 163
Lines 8810 8837 +27
Branches 833 835 +2
============================================
+ Hits 5212 5230 +18
- Misses 3332 3339 +7
- Partials 266 268 +2
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...he/uniffle/coordinator/CoordinatorGrpcService.java | 2.31% <0.00%> (-0.03%) |
:arrow_down: |
| ...ache/uniffle/coordinator/SimpleClusterManager.java | 86.61% <53.33%> (-4.46%) |
:arrow_down: |
| ...rg/apache/uniffle/coordinator/CoordinatorConf.java | 97.26% <100.00%> (+0.20%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Is startup-silent-period a good name? What's name of similar mechanism of Yarn?
Is
startup-silent-perioda good name? What's name of similar mechanism of Yarn?
No such config in Yarn. Safe mode is for HDFS. I think it's not proper in Uniffle scenarios
Do u have any ideas?
Is
startup-silent-perioda good name? What's name of similar mechanism of Yarn?No such config in Yarn.
Safe modeis for HDFS. I think it's not proper in Uniffle scenariosDo u have any ideas?
It's ok for Uniffle. I don't have another good ideas.
It's ok for Uniffle. I don't have another good ideas.
Got it. I will rename to safemode.
It's ok for Uniffle. I don't have another good ideas.
Got it. I will rename to safemode.
I means that startup-silent-period is ok.
OK. Do u have any other advice? @jerqi
Updated.
- Add the doc of these configs
- Explain why make the startup-slient-period stop default.
Error: Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.998 s <<< FAILURE! - in org.apache.uniffle.test.CoordinatorGrpcTest
Error: rpcMetricsTest Time elapsed: 1.035 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <0.0> but was: <1.0>
CI failed. https://github.com/apache/incubator-uniffle/actions/runs/3216468616/jobs/5258387024 #244
Rerun it.
Unrelated flaky test failed, I will rerun the flaky test