k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

`nvidia.com/gpu.memory` capacity

Open faust64 opened this issue 2 years ago • 1 comments

Hey,

I have a customer ... using nvidia GPU operator, alongside some custom controller (fabric8), that reads the nvidia.com/gpu.memory label added by gpu feature discovery, then patching node objects adding some nvidia.com/gpu.memory entry in node capacity/allocatable resources.

I was surprised to see this is not managed by nvidia operator OOB.

Setting this, our clusters end-users are then able to schedule pods without requesting GPU cores explicitly - thus, a single GPU core may be used by more than one containers.

Any plan to implement something similar? I don't think I can share my customer's code, and I'm not sure java code would help here ... For the record, while the following adds a label to nodes ( https://github.com/NVIDIA/gpu-feature-discovery/blob/main/internal/lm/resource.go#L36-L73 ), we might be able to patch the corresponding Node's status.capacity, adding or patching an entry for resource named nvidia.com/gpu.memory.

Thanks!

faust64 avatar Mar 14 '23 23:03 faust64

/cc @klueska

elezar avatar Mar 15 '23 05:03 elezar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 26 '24 04:08 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Sep 25 '24 04:09 github-actions[bot]