[CELEBORN-2191] Added factor of number of disks per group when calculating allocation ratio in LoadAwareSlot allocation
What changes were proposed in this pull request?
Updated celeborn load aware slot allocation algorithm. Current implementation does not take into account cases when less number of disks are present in a group and this results in higher number of slots allocated to those disks.
int groupSizeSize = (int) Math.ceil(usableDisks.size() / (double) diskGroupCount); for (int i = 0; i < diskGroupCount; i++) { List<DiskInfo> diskList = new ArrayList<>(); if (startIndex >= usableDisks.size()) { continue; } if (startIndex + groupSizeSize <= diskCount) { diskList.addAll(usableDisks.subList(startIndex, startIndex + groupSizeSize)); startIndex += groupSizeSize; } else { diskList.addAll(usableDisks.subList(startIndex, usableDisks.size())); startIndex = usableDisks.size(); } diskGroups.add(diskList); }
Due to the above disk placement strategy, it is possible that the number of disks in the slowest group is less than the number of disks in the faster groups. This could result in per disk more number of slots allocated to the slower disks compared to the faster disks.
In the below PR, using a decaying weighted average considering the number of workers per group in calculating the allocationRatio value which effectively allocates slots to disks using the above factors.
Why are the changes needed?
Improves allocation in load aware algorithm by taking into account number of disks per group. Test - Number of disk groups - 5 Disk group gradient - 0.1 Weight for average flush time - 1 Weight for average fetch time - 1 Number of celeborn workers in cluster - 9
Current implementation:
currentAllocationRatio = [0.2398, 0.2179, 0.1982, 0.1802, 0.1637] Job Output 5000 partitions SlotAllocation per group = [1199, 1090, 991, 901, 819] Number of workers per group = [2, 2, 2, 2, 1]
Number of slots allocated per worker = [600, 545, 496, 451, 819] Highest number of slots allocated to the slowest worker
Proposed implementation Setting celeborn.master.slot.assign.loadAware.diskCountWeight to 1
currentAllocationRatio = [0.2612, 0.2373, 0.2159, 0.1963, 0.0891] Job Output 5000 partitions Slots allocated per group = [1306, 1187, 1079, 982, 446] Number of slots per worker = [653, 594, 540, 491, 446]
Least number of slots allocated to the slowest worker
Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested in staging setup
@AmandeepSingh285, thanks for first contribution. Could you please create a JIRA ticket which could refer to https://github.com/apache/celeborn/issues/1053?
@SteNicholas done - https://issues.apache.org/jira/browse/CELEBORN-2191
@AmandeepSingh285 As you can see that the disks are split into groups evenly, why do you still want to add disk count weight?
int groupSizeSize = (int) Math.ceil(usableDisks.size() / (double) diskGroupCount);
@FMX all the calculations for the issue are given in the pr summary. Basically there are chances of more slots being allocated to the slower disks compared to the faster disks. This also gives more control to service owners to control number of slots allocated to each group by updating the weight assigned.
@AmandeepSingh285
Thanks for your enthusiasm about this PR, but this pr's functionality can be replaced by tuning the diskGroupGradient. I added some calculations down here to clarify that you just need to tweak your configuration. The algorithm is to allocator more workload to faster disk groups, not the worker groups, while your modification changed the original purpose.
For example, we have 9 workers; each worker has only one storage directory.
| 1 | 1.1 | 1.21 | 1.331 | 1.4641 | 6.1051 | ||
|---|---|---|---|---|---|---|---|
| gradient | 0.1 | total task count | |||||
| group allocation ratio | 0.163797481 | 0.180177229 | 0.198194952 | 0.218014447 | 0.239815892 | 5000 | |
| group allocation result | 818.987404 | 900.8861444 | 990.9747588 | 1090.072235 | 1199.079458 | ||
| 1 | 1.3 | 1.69 | 2.197 | 2.8561 | 9.0431 | ||
| gradient | 0.3 | total task count | |||||
| group allocation ratio | 0.110581548 | 0.143756013 | 0.186882817 | 0.242947662 | 0.31583196 | 5000 | |
| group allocation result | 552.9077418 | 718.7800644 | 934.4140837 | 1214.738309 | 1579.159801 |