dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus

Open Vijaygawate opened this issue 6 months ago • 2 comments

Ask your question

I have installed prometheus stack, prometheus adapter and dcgm exporter, but when i am trying to get this metrics it is giving below error

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/DCGM_FI_DEV_GPU_UTIL" | jq . Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods

What I am doing, I have 2 node groups in EKS, one is normal EC2 instance group which doesnt have GPUs, and on this node I have installed prometheus stack and prometheus adapter and I have GPU node group on which I have installed dcgm exporter.

Is this is due to this? means I should install all components on GPU node only then it will work?

Vijaygawate avatar Aug 23 '24 05:08 Vijaygawate