k8s-device-plugin Access NVIDIA GPUs in K8s in a non-privileged container

Hello - I'm trying to see if it's possible to deploy NVIDIA DCGM on K8s with the securityContext.privileged field set to false for security reasons.

I was able to get this working by setting the container's resource requests as the following:

          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
              drop:
                - ALL

However, this is not ideal for a few reasons:

We sacrifice an entire GPU just for monitoring, which is an over-allocation as DCGM does not need the full GPU compute capacity.
This prevents other workloads from using an expensive resource.
The Kubernetes scheduler will only allocate pods on nodes with excess GPU capacity.
The container only seems to have access to one GPU device instead of all of the devices available on the node.

Is there any way to permit the container device access without reserving more of the resource requests via nvida.com/gpu?

Thanks for any help you can provide.

Mar 18 '24 22:03 pintohutch

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Jun 18 '24 04:06 github-actions[bot]

Hey @elezar - I see that you're assigned to this. Is this feasible in any way that you know of?

Jun 18 '24 13:06 pintohutch

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Sep 17 '24 04:09 github-actions[bot]

Hey @elezar gentle ping :)

Sep 17 '24 12:09 pintohutch

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Dec 17 '24 04:12 github-actions[bot]

Isn't this more appropriate for either the DCGM or DCGM Exporter repositories? If this refers to deploying DCGM Exporter, then the DaemonSet used to do so is neither privileged nor does it request any GPUs to do so. It does use node labels to affine the DCGM Exporter pods to only those nodes with an NVIDIA GPU(s).

Dec 19 '24 13:12 chipzoller