Why not show GPU resource on the dashboard
I want know why not GPU resource information on the dashboard? When I use kubectl describe nodes CLI get the GPU detailed information,But I didn't see any GPU information on the dashboard.This is the plan?
Environment
Dashboard version:
Kubernetes version:
Operating system:
Node.js version:
Go version:
Steps to reproduce
Observed result
Expected result
Comments
This is the plan?
Yes it is.
@wjdfx I assume that GPU resource is some information on node details. I have never tried this setup, so I don't know what it looks like.
@maciaszczykm
do you mean, yet it is planned to add this information sometime in the future? or yes it is the plan not to show this information? If so, why?
Some relevant docs: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
The key (under resources and limits) is alpha.kubernetes.io/nvidia-gpu (also alpha.kubernetes.io/nvidia-gpu-name can be specified with the --node-labels='alpha.kubernetes.io/nvidia-gpu-name=xxx kubelet option).
This is an alpha feature at the moment. It might make sense wait until it enters beta at least?
This is an alpha feature at the moment. It might make sense wait until it enters beta at least?
Yes, we should wait for at least beta.
Any follow-up in showing GPU stats in dashboard?
We were focused on more important topics lately, like security and logging in mechanism. This feature is rather low on our priority list for now. No ETA.
Any update?
Any update on showing GPU stats on Kubernetes Dashboard?
has any update to support show gpu info in dashboard?
any update?
This has low priority for us at the moment. If you are willing to contribute then let us know.
It would be great to have at least information that pod has limits/requests set on any device that is compatible with device plugin framework (so not only gpus) and how much of that resource is requested. For example if node has 4 tpu's and there are 3 pods each consuming one 1 tpu it should be visible somewhere ideally right next to cpu/memory. It would really help debugging scheduling issues if nothing else. At this point in time devices exposed by device plugin framework are treated like third class citizens. CPU/Memory is not enough.
I am not a contributor, but I am looking into this issue. I found this on Nvidia's website https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html#gpu-telemetry
Which is configured to export to this grafana dashboard https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
If we can replicate this effort, we could then setup the metrics-scraper to consume metrics with the same pattern that Nvidia uses to build that grafana dashboard. We would want to provide information on the cluster level, with node and namespace level metrics
@maciaszczykm is there anyone from the contributors working on this that we could help?