gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Time-slicing with multiple GPUs - asking for ability to block single GPU

Open Alexbay218 opened this issue 2 years ago • 3 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

I'm looking to have the ability to configure the scheduler to perform the exact opposite behavior as specified here: https://github.com/NVIDIA/gpu-operator/issues/386

Instead of grabbing gpu resources evenly from all gpus on the node, I'd like a config option to grab from one GPU at a time. This allows some applications to get exclusive access to a single GPU as needed while allowing the rest to time share.

2. Steps to reproduce the issue

  1. Perform a new fresh installation of the GPU operator in a cluster where nodes have more than one GPU.
  2. Enable time-slicing and configure it to allow 2 replicas per GPU
  3. Start a pod to consume 2 gpu extended resource
  4. Pod should have exclusive access to a single GPU, but instead it has access to two GPUs (as intended by https://github.com/NVIDIA/gpu-operator/issues/386)

Alexbay218 avatar Jan 04 '23 17:01 Alexbay218

@klueska does it make sense to introduce knobs(env/args) to control allocation logic during GetPreferredAllocation within the device plugin?

shivamerla avatar Jan 04 '23 18:01 shivamerla

@shivamerla I believe there are cases where we would want distributed GPU scheduling, but there might also be opposite scenarios. It would be great if this setting could be easily changed by configmap or so.

anencore94 avatar Jan 18 '24 09:01 anencore94