celeborn [CELEBORN-2191] Added factor of number of disks per group when calculating allocation ratio in LoadAwareSlot allocation

What changes were proposed in this pull request?

Updated celeborn load aware slot allocation algorithm. Current implementation does not take into account cases when less number of disks are present in a group and this results in higher number of slots allocated to those disks.

int groupSizeSize = (int) Math.ceil(usableDisks.size() / (double) diskGroupCount); for (int i = 0; i < diskGroupCount; i++) { List<DiskInfo> diskList = new ArrayList<>(); if (startIndex >= usableDisks.size()) { continue; } if (startIndex + groupSizeSize <= diskCount) { diskList.addAll(usableDisks.subList(startIndex, startIndex + groupSizeSize)); startIndex += groupSizeSize; } else { diskList.addAll(usableDisks.subList(startIndex, usableDisks.size())); startIndex = usableDisks.size(); } diskGroups.add(diskList); }

Due to the above disk placement strategy, it is possible that the number of disks in the slowest group is less than the number of disks in the faster groups. This could result in per disk more number of slots allocated to the slower disks compared to the faster disks.

In the below PR, using a decaying weighted average considering the number of workers per group in calculating the allocationRatio value which effectively allocates slots to disks using the above factors.

Why are the changes needed?

Improves allocation in load aware algorithm by taking into account number of disks per group. Test - Number of disk groups - 5 Disk group gradient - 0.1 Weight for average flush time - 1 Weight for average fetch time - 1 Number of celeborn workers in cluster - 9

Current implementation:

currentAllocationRatio = [0.2398, 0.2179, 0.1982, 0.1802, 0.1637] Job Output 5000 partitions SlotAllocation per group = [1199, 1090, 991, 901, 819] Number of workers per group = [2, 2, 2, 2, 1]

Number of slots allocated per worker = [600, 545, 496, 451, 819] Highest number of slots allocated to the slowest worker

Proposed implementation Setting celeborn.master.slot.assign.loadAware.diskCountWeight to 1

currentAllocationRatio = [0.2612, 0.2373, 0.2159, 0.1963, 0.0891] Job Output 5000 partitions Slots allocated per group = [1306, 1187, 1079, 982, 446] Number of slots per worker = [653, 594, 540, 491, 446]

Least number of slots allocated to the slowest worker

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested in staging setup

Oct 31 '25 09:10 AmandeepSingh285

@AmandeepSingh285, thanks for first contribution. Could you please create a JIRA ticket which could refer to https://github.com/apache/celeborn/issues/1053?

Oct 31 '25 09:10 SteNicholas

@SteNicholas done - https://issues.apache.org/jira/browse/CELEBORN-2191

Nov 03 '25 06:11 AmandeepSingh285

@AmandeepSingh285 As you can see that the disks are split into groups evenly, why do you still want to add disk count weight?

int groupSizeSize = (int) Math.ceil(usableDisks.size() / (double) diskGroupCount);

Nov 04 '25 04:11 FMX

@FMX all the calculations for the issue are given in the pr summary. Basically there are chances of more slots being allocated to the slower disks compared to the faster disks. This also gives more control to service owners to control number of slots allocated to each group by updating the weight assigned.

Nov 04 '25 05:11 AmandeepSingh285

@AmandeepSingh285 Thanks for your enthusiasm about this PR, but this pr's functionality can be replaced by tuning the diskGroupGradient. I added some calculations down here to clarify that you just need to tweak your configuration. The algorithm is to allocator more workload to faster disk groups, not the worker groups, while your modification changed the original purpose.

For example, we have 9 workers; each worker has only one storage directory.

	1	1.1	1.21	1.331	1.4641	6.1051
gradient	0.1						total task count
group allocation ratio	0.163797481	0.180177229	0.198194952	0.218014447	0.239815892		5000
group allocation result	818.987404	900.8861444	990.9747588	1090.072235	1199.079458

	1	1.3	1.69	2.197	2.8561	9.0431
gradient	0.3						total task count
group allocation ratio	0.110581548	0.143756013	0.186882817	0.242947662	0.31583196		5000
group allocation result	552.9077418	718.7800644	934.4140837	1214.738309	1579.159801

Nov 04 '25 12:11 FMX