How vGPU get licensed when using gpu-operator

Open yuzs2 opened this issue 2 years ago • 1 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu20.04
Kernel Version: 5.4.0-137-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): docker://20.10.23
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s
GPU Operator Version: v22.9.2

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. Hi, I'm trying to deploy gpu-operator on my k8s cluster, whose vGPU node is coming from Vsphere(VMware ESXi, 8). I want to use my vCS license (I have DLS instance), so I'm following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html However, after the deploy, I can only see unlicensed when running nvidia-smi -q on either the workload pods/nodes. (on node, even no nvidia-smi installed)

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

I have 535.54.06 driver on my Exsi host:

[root@exsi:~] nvidia-smi
Fri Dec  8 07:35:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.06              Driver Version: 535.54.06    CUDA Version: N/A      |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100X                   On  | 00000000:B5:00.0 Off |                    0 |
| N/A   38C    P0              70W / 300W |  80896MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100X                   On  | 00000000:DE:00.0 Off |                    0 |
| N/A   35C    P0              69W / 300W |  40448MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

I have DLS instance which is using vCS license. I generated the client_configuration_token.tok, and confirmed this token can work fine with the gridd.conf set FeatureType=4 in another legency k8s cluster.
I deployed the gpu-operator following the document: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html, (when building the driver container, I was using the driver: NVIDIA-Linux-x86_64-535.54.03-grid.run). The deployment seems fine:

$ k -n gpu-operator get cm,deploy,statefulset,daemonset,pods
NAME                                                        DATA   AGE
configmap/default-gpu-clients                               1      92d
configmap/default-mig-parted-config                         1      92d
configmap/gpu-operator-node-feature-discovery-worker-conf   1      92d
configmap/kube-root-ca.crt                                  1      92d
configmap/licensing-config                                  2      78m

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator                                 1/1     1            1           92d
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           92d

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
daemonset.apps/gpu-feature-discovery                        1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   92d
daemonset.apps/gpu-operator-node-feature-discovery-worker   4         4         4       4            4           <none>                                             92d
daemonset.apps/nvidia-container-toolkit-daemonset           1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       92d
daemonset.apps/nvidia-dcgm-exporter                         1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           92d
daemonset.apps/nvidia-device-plugin-daemonset               1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           92d
daemonset.apps/nvidia-driver-daemonset                      1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                  92d
daemonset.apps/nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             92d
daemonset.apps/nvidia-operator-validator                    1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      92d

NAME                                                              READY   STATUS        RESTARTS   AGE
pod/gpu-feature-discovery-nvphm                                   1/1     Running       0          71m
pod/gpu-operator-6ddf8d789d-szqmq                                 1/1     Running       0          72m
pod/gpu-operator-node-feature-discovery-master-59b4b67f4f-r4fqw   1/1     Running       0          72m
pod/gpu-operator-node-feature-discovery-worker-bk7tm              1/1     Running       0          72m
pod/gpu-operator-node-feature-discovery-worker-d95vb              1/1     Running       0          72m
pod/gpu-operator-node-feature-discovery-worker-j74sr              1/1     Running       0          72m
pod/gpu-operator-node-feature-discovery-worker-s65pw              1/1     Running       0          72m
pod/gpu-operator-node-feature-discovery-worker-x67jn              1/1     Terminating   0          92d
pod/nvidia-container-toolkit-daemonset-2wc8f                      1/1     Running       0          71m
pod/nvidia-cuda-validator-glwxb                                   0/1     Completed     0          63m
pod/nvidia-dcgm-exporter-z9txp                                    1/1     Running       0          71m
pod/nvidia-device-plugin-daemonset-n8z4r                          1/1     Running       0          71m
pod/nvidia-device-plugin-validator-sd6p7                          0/1     Completed     0          62m
pod/nvidia-driver-daemonset-hmcvq                                 1/1     Running       0          71m
pod/nvidia-operator-validator-g6msn                               1/1     Running       0          71m

Then I started a workload pod which just run a nvidia-smi -q, but in pod logs I can only see unlicensed:

vGPU Software Licensed Product
        Product Name                      : NVIDIA Virtual Compute Server
        License Status                    : Unlicensed (Restricted)

I also ssh into the GPU node, and tried to run nvidia-smi -q, but it told nvidia-smi is not installed. Btw, this is the output for nvidia-smi in pod:

root@pod:/# nvidia-smi
Fri Dec  8 07:16:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100D-40C                 On  | 00000000:02:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

I realized that vGPU 16 may not support vCS license system, so I re-create the licensing-config configmap with the client_configuration_token.tok from a NVAIE DLS instance, and set FeatureType=1 in the gridd.conf (they can work in other k8s cluster), and then restarted all the gpu-operator pods. But I can still see the same output in the last step. (pretty weird)

4. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE provided in the above
[x] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE provided in the above
[x] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME No
[x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers No
[x] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi
Fri Dec  8 08:03:50 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100D-40C                 On  | 00000000:02:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

➭ k -n gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-hmcvq -- nvidia-smi -q
    vGPU Software Licensed Product
        Product Name                      : NVIDIA Virtual Compute Server
        License Status                    : Unlicensed (Restricted)

[x] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh
[containerd.log](https://github.com/NVIDIA/gpu-operator/files/13611273/containerd.log)

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Dec 08 '23 08:12 yuzs2

@yuzs2 can you check for any errors in the dmesg from nvidia-gridd. dmesg | grep -i gridd

Dec 21 '23 01:12 shivamerla