Kevin Klues

Results 279 comments of Kevin Klues

Hi @dims thanks for pulling us in here. First, let me clear up some confusion. `nvidia-docker` has never been a fork of the docker source code. What is commonly referred...

We should move this dicussion to https://github.com/NVIDIA/nvidia-docker/issues as it is no longer relevant to the original issue.

The nvidia wrapper around runc will become obsolete once the next version of congtainerd comes out, as CDI support landed in containerd about a month ago: https://github.com/container-orchestrated-devices/container-device-interface It allows arbitrary...

With the recently accepted KEP linked below, general resource management is moving towards a more dynamic model going forward. One will no longer be limited to providing a simple "count"...

Please see this document on why this is not feasible under the current Kubernetes resource model: [Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes](https://docs.google.com/document/d/1Dxx5MwG_GiBeKOuMNwv4QbO8OqA7XFdzn7fzzI7AQDg/edit#) Once the following newly accepted Kubernetes Enhancement...

Once we have [Dynamic Resource Allocation](https://github.com/kubernetes/enhancements/pull/3064) all of what you propose will be possible. We do not plan to "hack" this support onto the existing plugin and instead will be...

This could happen if this calculation gives back the wrong value (in this case 79): https://github.com/NVIDIA/mig-parted/blob/main/pkg/types/mig_profile.go#L48 Would need to dig into why this would happen. Unfortunately there is no way...

The heavy-duty workaround is to update to a version of Kubernetes that contains this patch: https://github.com/kubernetes/kubernetes/pull/101771 The lighter-weight workaround would be to make sure that your pod requests a set...

Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the *only* container in the pod (no init...

OK. Yeah, everything looks good from the perspective of the pod specs, etc. I’m guessing you must be running into the runc bug then: https://github.com/opencontainers/runc/issues/2366#issue-609480075 And the only way to...