[CELEBORN-1982] Slot Selection Perf Improvements
What changes were proposed in this pull request?
After profiling to see where the hotspots are for slot selection, we identified 2 main areas:
- iter.remove (link) is a major hotspot, especially if partitionIdList is massive - since it is an ArrayList and we are removing from the begining - resulting in O(n) deletion costs.
haveDiskis computed per partitionId, iterated across all workers. We precompute this and store it as a field inWorkerInfo.
See the below flamegraph for the hotspot of iter.remove (oop_disjoint_arraycopy) after running a benchmark.
Below is what we actually observed in production which matches with the above observation from the benchmark:
Why are the changes needed?
speed up slot selection performance in the case of large partitionIds
Does this PR introduce any user-facing change?
no
How was this patch tested?
After applying the above changes, we can see the hotspot is removed in the flamegraph:
Benchmarks: Without changes:
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration 2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration 3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration 4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration 5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration 1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration 2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration 3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration 4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration 5: 907655.363 ±(99.9%) 11893.789 us/op
Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
882048.066 ±(99.9%) 98360.936 us/op [Average]
(min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)
# Run complete. Total time: 00:05:43
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 882048.066 ± 98360.936 us/op
Process finished with exit code 0
With changes:
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration 2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration 3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration 4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration 5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration 1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration 2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration 3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration 4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration 5: 134179.546 ±(99.9%) 5137.425 us/op
Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
135207.354 ±(99.9%) 20845.544 us/op [Average]
(min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)
# Run complete. Total time: 00:05:29
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 135207.354 ± 20845.544 us/op
Process finished with exit code 0
882048.066 us/ops without changes vs 135207.354 us/op with changes. That is about 6.5x improvement.
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
Closing due to conflicts after manual updates to package versions .Will regenerate
OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting @dependabot ignore this major version or @dependabot ignore this minor version.
If you change your mind, just re-open this PR and I'll resolve any conflicts on it.