celeborn [CELEBORN-1982] Slot Selection Perf Improvements

What changes were proposed in this pull request?

After profiling to see where the hotspots are for slot selection, we identified 2 main areas:

iter.remove (link) is a major hotspot, especially if partitionIdList is massive - since it is an ArrayList and we are removing from the begining - resulting in O(n) deletion costs.
haveDisk is computed per partitionId, iterated across all workers. We precompute this and store it as a field in WorkerInfo.

See the below flamegraph for the hotspot of iter.remove (oop_disjoint_arraycopy) after running a benchmark.

Screenshot 2025-04-24 at 12 58 34 AM

Below is what we actually observed in production which matches with the above observation from the benchmark: realprodflamegraph

Why are the changes needed?

speed up slot selection performance in the case of large partitionIds

Does this PR introduce any user-facing change?

no

How was this patch tested?

After applying the above changes, we can see the hotspot is removed in the flamegraph: Screenshot 2025-04-24 at 12 53 24 AM

Benchmarks: Without changes:

# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration   2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration   3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration   4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration   5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration   1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration   2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration   3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration   4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration   5: 907655.363 ±(99.9%) 11893.789 us/op


Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  882048.066 ±(99.9%) 98360.936 us/op [Average]
  (min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
  CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)


# Run complete. Total time: 00:05:43

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  882048.066 ± 98360.936  us/op

Process finished with exit code 0

With changes:

# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration   2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration   3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration   4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration   5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration   1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration   2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration   3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration   4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration   5: 134179.546 ±(99.9%) 5137.425 us/op


Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  135207.354 ±(99.9%) 20845.544 us/op [Average]
  (min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
  CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)


# Run complete. Total time: 00:05:29

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  135207.354 ± 20845.544  us/op

Process finished with exit code 0

882048.066 us/ops without changes vs 135207.354 us/op with changes. That is about 6.5x improvement.

Apr 24 '25 07:04 akpatnam25

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Apr 25 '25 10:04 sonarqubecloud[bot]

Closing due to conflicts after manual updates to package versions .Will regenerate

Sep 22 '25 11:09 arun-s-aot

OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting @dependabot ignore this major version or @dependabot ignore this minor version.

If you change your mind, just re-open this PR and I'll resolve any conflicts on it.

Sep 22 '25 11:09 dependabot[bot]