k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

can not distinguish t4 and a100 ?

Open ggjjlldd opened this issue 2 years ago • 3 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

My machine have 5 nvidia gpu. like this, contain t4 card and a100 card. But this device cat not distinguish t4 and a100. all card be marked nvidia.com/gpu. I want t4 be marked by nvidia.com/t4 and a100 be marked by nvidia.com/a100. How can I do?

Capacity:
  cpu:                            64
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              459403376Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         263855916Ki
  nvidia.com/gpu:                 4
  nvidia.com/hostdev:             0
  nvidia.com/mig-1g.5gb:          7
  nvidia.com/mig-2g.10gb:         0
  nvidia.com/mig-4g.20gb:         0
  pods:                           110
root@k8s-gpuworker01:/var/lib/kubelet/device-plugins# nvidia-smi
Tue Feb  7 14:19:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:1A:00.0 Off |                    0 |
| N/A   32C    P8    14W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:3D:00.0 Off |                   On |
| N/A   32C    P0    66W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   30C    P0    54W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   28C    P0    40W / 250W |  40229MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCI...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   28C    P0    61W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  1    7   0   0  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    8   0   1  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    9   0   2  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   11   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   12   0   4  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   13   0   5  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   14   0   6  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    3   N/A  N/A   3135236      C   ...conda3/envs/dl/bin/python    40227MiB |
+-----------------------------------------------------------------------------+

2. Steps to reproduce the issue

install k8s device plugin kubectl describe node

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [ ] The output of nvidia-smi -a on your host
  • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
  • [ ] The k8s-device-plugin container logs
  • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • [ ] Docker version from docker version
  • [ ] Docker command, image and tag used
  • [ ] Kernel version from uname -a
  • [ ] Any relevant kernel output lines from dmesg
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • [ ] NVIDIA container library version from nvidia-container-cli -V
  • [ ] NVIDIA container library logs (see troubleshooting)

ggjjlldd avatar Feb 07 '23 08:02 ggjjlldd

We built a feature last summer to do exactly what you describe. It is feature complete, but currently disabled in the plugin awaiting approval from our product team. It is unclear when or if it will ever be approved.

Here is a description of the feature: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit

klueska avatar Feb 07 '23 08:02 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]

I also really need this feature!

mkarami2024 avatar Apr 09 '24 13:04 mkarami2024