gpu-operator Time-slicing with multiple GPUs - asking for two GPUs puts both slots on a single GPU

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[X] Are you running on an Ubuntu 18.04 node?
[X] Are you running Kubernetes v1.13+?
[X] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[X] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[X] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

The new time-slicing functionality seemed like a great feature for us to use in our CI system. However, some of our tests need to run on two physical GPUs (simply because we want to test that it works).

Unfortunately, it seems like Kubernetes only counts the total number of slots, which means that our jobs will occasionally get both slots assigned to the same GPU, after which our test fails (since we start it by checking that we really have two physical GPUs).

While this might be intentional for now, I figured I should file a note about it, since it likely means the time-slicing solution for now won't work for any CI or other setup that need to be able to test things on >1 physical GPU.

2. Steps to reproduce the issue

Perform a new fresh installation of the GPU operator in a cluster where nodes have two NVIDIA GPUs each.
Enable time-slicing and configure it to allow e.g. 4 replicas per GPU
Start a number of jobs, with some only using a single GPU slot.
In some cases we observe that a job with two GPU slots assigned will still only see a single physical GPU.

Again, this isn't really a bug, since we're no worse off than without time-slicing, but for now we've had to disable it again :-)

Aug 06 '22 11:08 eriklindahl

This seems reasonable. If more than 1 shared GPU is requested, we should try and pull the replicas from different full GPUs (if possible) to simulate getting exclusive access to both of them. Will slate this in for the next plugin release.

Aug 06 '22 12:08 klueska

Awesome! Let me know if you need help testing or hacking. From our CI point-of-view, it's also important that we at least have a setting so we can make sure the pod doesn't start until k8s can give (shared) access to two separate GPUs, if that was requested :-)

I was thinking of hacking this myself, but then realised it might not be entirely trivial since k8s just reports the total number of GPU slots. I'm not entirely sure if it's possible to design resources in such a way that k8s won't schedule a pod unless there are (e.g.) two slots available on separate GPUs.

Aug 06 '22 15:08 eriklindahl

Maybe when limits: nvidia.com/gpu: >1 then allow an option like limits: nvidia.com/anti-affinity: true to force the behaviour. I was very surprised this is not the default, idk what kind of use case would cover a single user asking 2 time-slices on the same GPU when more GPUs are free.

Aug 23 '22 12:08 elgalu

Given how time-slicing works, I don't think "two slots on one GPU" even means anything (a process seeing a single GPU can still start many processes on it), which I suspect is the reason it's suggested to disallow it by default.

At least for our CI usage, any affinity setting that just tries to achieve it won't suffice - we need to be certain that we actually have two separate GPUs when the pod has started :-)

Aug 23 '22 12:08 eriklindahl

I was very surprised this is not the default

It's not the "default" per se, we just don't do anything intelligent at the moment to influence the algorithm run inside the kubelet to allocate the GPUs to a user. It might pick two replicas from the same GPU or it might not, it doesn't care. There are ways to influence this decision though, and we will leverage this in our update for the next release.

Aug 23 '22 13:08 klueska

This has now been merged and will be part of the upcoming 0.13.0 release: https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/224

Oct 20 '22 12:10 klueska

Hi @eriklindahl @elgalu.

We have just released v0.13.0.

This includes a fix to balance replicas more effectively. Please give it a go and let us know if you have any issues.

Nov 30 '22 14:11 elezar