celeborn icon indicating copy to clipboard operation
celeborn copied to clipboard

[CELEBORN-2191] Added factor of number of disks per group when calculating allocation ratio in LoadAwareSlot allocation

Open AmandeepSingh285 opened this issue 5 months ago • 5 comments

What changes were proposed in this pull request?

Updated celeborn load aware slot allocation algorithm. Current implementation does not take into account cases when less number of disks are present in a group and this results in higher number of slots allocated to those disks.

int groupSizeSize = (int) Math.ceil(usableDisks.size() / (double) diskGroupCount); for (int i = 0; i < diskGroupCount; i++) { List<DiskInfo> diskList = new ArrayList<>(); if (startIndex >= usableDisks.size()) { continue; } if (startIndex + groupSizeSize <= diskCount) { diskList.addAll(usableDisks.subList(startIndex, startIndex + groupSizeSize)); startIndex += groupSizeSize; } else { diskList.addAll(usableDisks.subList(startIndex, usableDisks.size())); startIndex = usableDisks.size(); } diskGroups.add(diskList); }

Due to the above disk placement strategy, it is possible that the number of disks in the slowest group is less than the number of disks in the faster groups. This could result in per disk more number of slots allocated to the slower disks compared to the faster disks.

In the below PR, using a decaying weighted average considering the number of workers per group in calculating the allocationRatio value which effectively allocates slots to disks using the above factors.

Why are the changes needed?

Improves allocation in load aware algorithm by taking into account number of disks per group. Test - Number of disk groups - 5 Disk group gradient - 0.1 Weight for average flush time - 1 Weight for average fetch time - 1 Number of celeborn workers in cluster - 9

Current implementation:

currentAllocationRatio = [0.2398, 0.2179, 0.1982, 0.1802, 0.1637] Job Output 5000 partitions SlotAllocation per group = [1199, 1090, 991, 901, 819] Number of workers per group = [2, 2, 2, 2, 1]

Number of slots allocated per worker = [600, 545, 496, 451, 819] Highest number of slots allocated to the slowest worker

Proposed implementation Setting celeborn.master.slot.assign.loadAware.diskCountWeight to 1

currentAllocationRatio = [0.2612, 0.2373, 0.2159, 0.1963, 0.0891] Job Output 5000 partitions Slots allocated per group = [1306, 1187, 1079, 982, 446] Number of slots per worker = [653, 594, 540, 491, 446]

Least number of slots allocated to the slowest worker

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested in staging setup

AmandeepSingh285 avatar Oct 31 '25 09:10 AmandeepSingh285

@AmandeepSingh285, thanks for first contribution. Could you please create a JIRA ticket which could refer to https://github.com/apache/celeborn/issues/1053?

SteNicholas avatar Oct 31 '25 09:10 SteNicholas

@SteNicholas done - https://issues.apache.org/jira/browse/CELEBORN-2191

AmandeepSingh285 avatar Nov 03 '25 06:11 AmandeepSingh285

@AmandeepSingh285 As you can see that the disks are split into groups evenly, why do you still want to add disk count weight?

int groupSizeSize = (int) Math.ceil(usableDisks.size() / (double) diskGroupCount);

FMX avatar Nov 04 '25 04:11 FMX

@FMX all the calculations for the issue are given in the pr summary. Basically there are chances of more slots being allocated to the slower disks compared to the faster disks. This also gives more control to service owners to control number of slots allocated to each group by updating the weight assigned.

AmandeepSingh285 avatar Nov 04 '25 05:11 AmandeepSingh285

@AmandeepSingh285 Thanks for your enthusiasm about this PR, but this pr's functionality can be replaced by tuning the diskGroupGradient. I added some calculations down here to clarify that you just need to tweak your configuration. The algorithm is to allocator more workload to faster disk groups, not the worker groups, while your modification changed the original purpose.

For example, we have 9 workers; each worker has only one storage directory.

  1 1.1 1.21 1.331 1.4641 6.1051  
gradient 0.1           total task count
group allocation ratio 0.163797481 0.180177229 0.198194952 0.218014447 0.239815892   5000
group allocation result 818.987404 900.8861444 990.9747588 1090.072235 1199.079458    
               
  1 1.3 1.69 2.197 2.8561 9.0431  
gradient 0.3           total task count
group allocation ratio 0.110581548 0.143756013 0.186882817 0.242947662 0.31583196   5000
group allocation result 552.9077418 718.7800644 934.4140837 1214.738309 1579.159801    

FMX avatar Nov 04 '25 12:11 FMX