k8s-device-plugin
k8s-device-plugin copied to clipboard
can not distinguish t4 and a100 ?
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
My machine have 5 nvidia gpu. like this, contain t4 card and a100 card. But this device cat not distinguish t4 and a100. all card be marked nvidia.com/gpu. I want t4 be marked by nvidia.com/t4 and a100 be marked by nvidia.com/a100. How can I do?
Capacity:
cpu: 64
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 459403376Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263855916Ki
nvidia.com/gpu: 4
nvidia.com/hostdev: 0
nvidia.com/mig-1g.5gb: 7
nvidia.com/mig-2g.10gb: 0
nvidia.com/mig-4g.20gb: 0
pods: 110
root@k8s-gpuworker01:/var/lib/kubelet/device-plugins# nvidia-smi
Tue Feb 7 14:19:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:1A:00.0 Off | 0 |
| N/A 32C P8 14W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:3D:00.0 Off | On |
| N/A 32C P0 66W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:3E:00.0 Off | 0 |
| N/A 30C P0 54W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:88:00.0 Off | 0 |
| N/A 28C P0 40W / 250W | 40229MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-PCI... On | 00000000:89:00.0 Off | 0 |
| N/A 28C P0 61W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 1 7 0 0 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 8 0 1 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 9 0 2 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 11 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 12 0 4 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 13 0 5 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 14 0 6 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 3 N/A N/A 3135236 C ...conda3/envs/dl/bin/python 40227MiB |
+-----------------------------------------------------------------------------+
2. Steps to reproduce the issue
install k8s device plugin kubectl describe node
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The output of
nvidia-smi -a
on your host - [ ] Your docker configuration file (e.g:
/etc/docker/daemon.json
) - [ ] The k8s-device-plugin container logs
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from
docker version
- [ ] Docker command, image and tag used
- [ ] Kernel version from
uname -a
- [ ] Any relevant kernel output lines from
dmesg
- [ ] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
- [ ] NVIDIA container library version from
nvidia-container-cli -V
- [ ] NVIDIA container library logs (see troubleshooting)
We built a feature last summer to do exactly what you describe. It is feature complete, but currently disabled in the plugin awaiting approval from our product team. It is unclear when or if it will ever be approved.
Here is a description of the feature: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
I also really need this feature!