How vGPU get licensed when using gpu-operator
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu20.04
- Kernel Version: 5.4.0-137-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): docker://20.10.23
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s
- GPU Operator Version: v22.9.2
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
Hi, I'm trying to deploy gpu-operator on my k8s cluster, whose vGPU node is coming from Vsphere(VMware ESXi, 8). I want to use my vCS license (I have DLS instance), so I'm following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html
However, after the deploy, I can only see unlicensed when running nvidia-smi -q on either the workload pods/nodes. (on node, even no nvidia-smi installed)
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
- I have 535.54.06 driver on my Exsi host:
[root@exsi:~] nvidia-smi
Fri Dec 8 07:35:02 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.06 Driver Version: 535.54.06 CUDA Version: N/A |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100X On | 00000000:B5:00.0 Off | 0 |
| N/A 38C P0 70W / 300W | 80896MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100X On | 00000000:DE:00.0 Off | 0 |
| N/A 35C P0 69W / 300W | 40448MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
- I have DLS instance which is using vCS license. I generated the client_configuration_token.tok, and confirmed this token can work fine with the gridd.conf set
FeatureType=4in another legency k8s cluster. - I deployed the gpu-operator following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html, (when building the driver container, I was using the driver: NVIDIA-Linux-x86_64-535.54.03-grid.run). The deployment seems fine:
$ k -n gpu-operator get cm,deploy,statefulset,daemonset,pods
NAME DATA AGE
configmap/default-gpu-clients 1 92d
configmap/default-mig-parted-config 1 92d
configmap/gpu-operator-node-feature-discovery-worker-conf 1 92d
configmap/kube-root-ca.crt 1 92d
configmap/licensing-config 2 78m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gpu-operator 1/1 1 1 92d
deployment.apps/gpu-operator-node-feature-discovery-master 1/1 1 1 92d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 92d
daemonset.apps/gpu-operator-node-feature-discovery-worker 4 4 4 4 4 <none> 92d
daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 92d
daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 92d
daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 92d
daemonset.apps/nvidia-driver-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.driver=true 92d
daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 92d
daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 92d
NAME READY STATUS RESTARTS AGE
pod/gpu-feature-discovery-nvphm 1/1 Running 0 71m
pod/gpu-operator-6ddf8d789d-szqmq 1/1 Running 0 72m
pod/gpu-operator-node-feature-discovery-master-59b4b67f4f-r4fqw 1/1 Running 0 72m
pod/gpu-operator-node-feature-discovery-worker-bk7tm 1/1 Running 0 72m
pod/gpu-operator-node-feature-discovery-worker-d95vb 1/1 Running 0 72m
pod/gpu-operator-node-feature-discovery-worker-j74sr 1/1 Running 0 72m
pod/gpu-operator-node-feature-discovery-worker-s65pw 1/1 Running 0 72m
pod/gpu-operator-node-feature-discovery-worker-x67jn 1/1 Terminating 0 92d
pod/nvidia-container-toolkit-daemonset-2wc8f 1/1 Running 0 71m
pod/nvidia-cuda-validator-glwxb 0/1 Completed 0 63m
pod/nvidia-dcgm-exporter-z9txp 1/1 Running 0 71m
pod/nvidia-device-plugin-daemonset-n8z4r 1/1 Running 0 71m
pod/nvidia-device-plugin-validator-sd6p7 0/1 Completed 0 62m
pod/nvidia-driver-daemonset-hmcvq 1/1 Running 0 71m
pod/nvidia-operator-validator-g6msn 1/1 Running 0 71m
- Then I started a workload pod which just run a
nvidia-smi -q, but in pod logs I can only seeunlicensed:
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Compute Server
License Status : Unlicensed (Restricted)
I also ssh into the GPU node, and tried to run nvidia-smi -q, but it told nvidia-smi is not installed.
Btw, this is the output for nvidia-smi in pod:
root@pod:/# nvidia-smi
Fri Dec 8 07:16:35 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID A100D-40C On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
- I realized that vGPU 16 may not support vCS license system, so I re-create the
licensing-configconfigmap with the client_configuration_token.tok from a NVAIE DLS instance, and setFeatureType=1in the gridd.conf (they can work in other k8s cluster), and then restarted all the gpu-operator pods. But I can still see the same output in the last step. (pretty weird)
4. Information to attach (optional if deemed irrelevant)
- [x] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACEprovided in the above - [x] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACEprovided in the above - [x] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAMENo - [x] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containersNo - [x] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi
Fri Dec 8 08:03:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID A100D-40C On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi -q
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Compute Server
License Status : Unlicensed (Restricted)
- [x] containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
[containerd.log](https://github.com/NVIDIA/gpu-operator/files/13611273/containerd.log)
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
@yuzs2 can you check for any errors in the dmesg from nvidia-gridd. dmesg | grep -i gridd