pai
pai copied to clipboard
Different SKU support within one VC
What would you like to be added: Different SKU configuration support within one virtual cluster
Why is this needed: Different training/inference task may have different GPU/CPU/MEMORY requirement. Maybe it is flexible to have different SKU within one cluster.
Without this feature, how does the current module work: Only one SKU unit for one cluster. It works but it might not the best usage.
Components that may involve changes: rest-server, hivedscheduler
Can you give some examples why one SKU per VC can not fit your demands In current design, we only support one SKU per VC. Maybe there is other workaround to solve your issue.
Sometimes it might need 1GPU/4CPU/40GB MEM and 1GPU/8CPU/40GB MEM SKU requirement in on VC.
Perhaps limited memory is the essence of this feature requirement.
Suppose I have a machine: 64 CPU, 8 GPU, 128G MEM. According to the current strategy, when running a single GPU job, each job can only be allocated to 16G MEM (reserved by the system, the actual usable should be around 15G). In this way, the available memory is often not as much as the available GPU memory, which often leads to OOM killed.
Of course, the machine memory can be upgraded to 512G MEM, which can reduce the occurrence of many OOM killed events. However, on the one hand, the monitoring of the host computer found that 128G of memory is completely sufficient, because most jobs are small memory requirements, and few jobs are large memory requirements. On the other hand, 512G memory is more expensive than 128G memory.
Maybe we can design a strategy for memory overallocation. For example, when the host machine's memory is still sufficient, do not execute OOM kill. When the host's memory tends to be exhausted, the job that consumes the most memory will be killed.
Thanks @JosephKang & @siaimes For current design, we fix the SKU size to reduce external fragment and provide guarantee for each VC resource. To use different SKU in same vc or overallocation resource might break our assumption. We will discuss it first then back to you.
Because the new job submission page introduced by v1.7.0 allows user-defined CPU, GPU, and memory, this will cause a lot of fragments, I have no desire to upgrade to v1.7.0. I thought you didn't consider the fragment issue, but now I think you might have other considerations?
If we design an over-allocation strategy, can we achieve a balance between fragmentation and resource utilization?
@siaimes For v1.7.0, we only allow user to customize the resource when using default scheduler (which don't have vc concept). If you choose Hived as the scheduler, user can only select the SKU. The behavior is consistent with previous version
@siaimes For v1.7.0, we only allow user to customize the resource when using default scheduler (which don't have vc concept). If you choose Hived as the scheduler, user can only select the SKU. The behavior is consistent with previous version
Okay, I get it, thank you.
Thanks @JosephKang & @siaimes For current design, we fix the SKU size to reduce external fragment and provide guarantee for each VC resource. To use different SKU in same vc or overallocation resource might break our assumption. We will discuss it first then back to you.
https://github.com/microsoft/pai/blob/5153a4d9a6b8c8a4126122184f65c463ca707c6d/src/rest-server/src/models/v2/job/k8s.js#L488-L491
In order to achieve the goal of overallocation, we only need to change the resources.limits.memory
to resources.requests.memory
. The resources.limits.memory
can be set to a relatively large value, such as max(host memory - 4Gi, requests)
. In this way, when a container uses more memory than requests, it will not be OOM-killed
. Only when the host's memory is almost exhausted, such a pod will be OOM-killed
.
This should not break the SKU assumption.