gpu-operator
gpu-operator copied to clipboard
Time-slicing with multiple GPUs - asking for ability to block single GPU
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
I'm looking to have the ability to configure the scheduler to perform the exact opposite behavior as specified here: https://github.com/NVIDIA/gpu-operator/issues/386
Instead of grabbing gpu resources evenly from all gpus on the node, I'd like a config option to grab from one GPU at a time. This allows some applications to get exclusive access to a single GPU as needed while allowing the rest to time share.
2. Steps to reproduce the issue
- Perform a new fresh installation of the GPU operator in a cluster where nodes have more than one GPU.
- Enable time-slicing and configure it to allow 2 replicas per GPU
- Start a pod to consume 2 gpu extended resource
- Pod should have exclusive access to a single GPU, but instead it has access to two GPUs (as intended by https://github.com/NVIDIA/gpu-operator/issues/386)
@klueska does it make sense to introduce knobs(env/args) to control allocation logic during GetPreferredAllocation
within the device plugin?
@shivamerla I believe there are cases where we would want distributed GPU scheduling, but there might also be opposite scenarios. It would be great if this setting could be easily changed by configmap or so.