k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Implement MPS natively as in linux

Open thien-lm opened this issue 1 year ago • 1 comments

How GPU resource be shared by MPS in Linux ?

  1. GPU compute mode will be set to EXCLUSIVE_PROCESS to ensure any process request to use GPU will need to talk to MPS control daemon and MPS server
  2. By default, each MPS client process can be access up to 100% memory and 100% available threads of GPUs
  3. MPS resource can be limited from MPS control daemon level, MPS client level to CUDA context level: https://docs.nvidia.com/deploy/mps/#performance

refer: https://docs.nvidia.com/deploy/mps

Strategies to provisioning resource in MPS

refer: https://docs.nvidia.com/deploy/mps/#performance:

  1. A common provisioning strategy is to uniformly partition the available threads equally to each MPS client processes - this is how NVDP devs implemented MPS
  2. A more optimal strategy is to uniformly partition the portion by half of the number of expected clients
  3. The near optimal provision strategy is to non-uniformly partition the available threads based on the workloads of each MPS clients (i.e., set active thread percentage to 30% for client 1 and set active thread percentage to 70 % client 2 if the ratio of the client1 workload and the client2 workload is 30%: 70%) - this is what i want
  4. The most optimal provision strategy is to precisely limit the number of SMs to use for each MPS clients knowing the execution resource requirements of each client

How did the main branch of nvidia device plugin implemented MPS?

  • NVDP devs just set hard limit at control daemon level, by 100/n for both memory and threads, with n is the number of replicas
  • I think it will be so inconvenient for us to use MPS

My solution

  1. I will remove the hard limit 100/n be set at control daemon level
  2. Instead, i wll set resource limit for each container will use MPS in Kubernetes by two environment variable: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE and CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
  3. By that way, the resource provisioning of MPS in NVDP will be very flexible, because each container will be provided the number of threads and memory as it need, was that so nice?

thien-lm avatar Jul 08 '24 14:07 thien-lm

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Oct 07 '24 04:10 github-actions[bot]

This pull request was automatically closed due to inactivity.

github-actions[bot] avatar Nov 07 '24 04:11 github-actions[bot]