incubator-uniffle
incubator-uniffle copied to clipboard
[#1398] fix(mr,tez): Make attempId computable and move it to taskAttemptId in BlockId layout.
What changes were proposed in this pull request?
Before this PR, in MR and TEZ engine:
- attemptId is in sequenceNo of BlockId instead of taskAttemptId.
- taskAttemptId is long which is not necessary instead of int.
- attempId is fixed 6 bit.
After this PR:
- attemptId is in taskAttemptId. This is more reasonable.
- taskAttemptId is changed to int.
- attempId is calculated from max num of allowed failures and whether speculative execution is enabled.
Why are the changes needed?
Fix: #1398
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UT and integrated tests.
Codecov Report
Attention: Patch coverage is 79.10448%
with 14 lines
in your changes are missing coverage. Please review.
Project coverage is 54.88%. Comparing base (
dd67774
) to head (bc5585d
). Report is 4 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #1418 +/- ##
============================================
+ Coverage 54.01% 54.88% +0.87%
- Complexity 2863 2868 +5
============================================
Files 438 418 -20
Lines 24850 22552 -2298
Branches 2114 2120 +6
============================================
- Hits 13423 12378 -1045
+ Misses 10586 9406 -1180
+ Partials 841 768 -73
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@jerqi Could you please provide suggestions on areas that need improvement?
cc @zhengchenyu could you help review this ?
Sine https://github.com/apache/incubator-uniffle/pull/1529 is merged into master, I think we should review this PR? After this PR, the blockid calculation for spark, mr, tez will remain consistent. Then we will reduce the probability of overflow problems.
@qijiale76 Can you reconstruct the code according to https://github.com/apache/incubator-uniffle/pull/1529?
@qijiale76 Do you want to push this forward?
@qijiale76 Do you want to push this forward?
Yes, I’ll reconstruct the code next week.
Test Results
2 657 files +10 2 657 suites +10 5h 30m 55s :stopwatch: + 4m 6s 946 tests ± 0 944 :white_check_mark: ± 0 1 :zzz: ±0 0 :x: ±0 1 :fire: ±0 11 799 runs +20 11 783 :white_check_mark: +20 15 :zzz: ±0 0 :x: ±0 1 :fire: ±0
For more details on these errors, see this check.
Results for commit a543fd2f. ± Comparison against base commit e0a2e3db.
This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.
org.apache.uniffle.shuffle.manager.RssShuffleManagerBaseTest ‑ testGetAttemptIdBits
org.apache.uniffle.shuffle.manager.RssShuffleManagerBaseTest ‑ testGetMaxAttemptNo
org.apache.uniffle.client.ClientUtilsTest ‑ testGetMaxAttemptNo
org.apache.uniffle.client.ClientUtilsTest ‑ testGetNumberOfSignificantBits
:recycle: This comment has been updated with latest results.
@EnricoMi Thanks for your very helpful review. I have updated this PR based on your suggestions and by referring to Spark's implementation. Could you please review the latest code again?
@EnricoMi, thank you very much for your reviews. I have reverted taskAttemptIds
to long
in the MR, Tez, and Spark engine clients. Could you please review this PR again?
@jerqi I still think taskattemptid should be int. Since we've limited taskattemptid to no more than 32 bits, int is enough. For spark, taskattemptid has changed, now does not comes from TaskContext::taskAttemptId.
See the code: https://github.com/apache/incubator-uniffle/blob/f7c6d2da237bd487d3cd0e21231108df90559cbe/common/src/main/java/org/apache/uniffle/common/util/BlockIdLayout.java#L69
Code paths that work fine with long
taskAttemptIds should show that via using long
, paths that are limited to int
ids should indicate that using int
. Then you know what the code supports and which code has limitations.
LGTM+1