dcgm-exporter
dcgm-exporter copied to clipboard
dcgm-exporter is not working on ec2 g5.48xlarge nodes
What is the version?
3.3.5-3.4.1-ubi9
What happened?
We are running dcgm-exported inside the containerd and on g5.48xlarge
dcgm-exporter is struggling to come up online with this error
[root@test-machine ~]# ctr run --env DCGM_EXPORTER_DEBUG=true --cap-add CAP_SYS_ADMIN --rm nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9 dcgm-exporter
2024/04/10 21:53:33 maxprocs: Leaving GOMAXPROCS=192: CPU quota undefined
time="2024-04-10T21:53:33Z" level=info msg="Starting dcgm-exporter"
time="2024-04-10T21:53:33Z" level=debug msg="Debug output is enabled"
time="2024-04-10T21:53:33Z" level=debug msg="Command line: /usr/bin/dcgm-exporter"
time="2024-04-10T21:53:33Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/default-counters.csv Address::9400 CollectInterval:30000 Kubernetes:false KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock}"
Error: Failed to initialize NVML
time="2024-04-10T21:53:33Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:269 +0x3d\npanic({0x17dbac0?, 0x28fb390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc00026d1e0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:509 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc00067a960)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:289 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:273 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cbda38?, 0xc000638550}, 0xc00049fb70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc000536a00)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:264 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc00062d080?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:249 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc00062d080, 0xc000536a00, {0xc0002a6050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc00034d200, {0x1cbd920?, 0x29c12a0}, {0xc0002a6050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc00049ff20?, {0xc0002a6050?, 0x1?, 0x1616700?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"
What did you expect to happen?
dcgm-exporter running without any issues and we can see this in the log output, for example we have dcgm-exporter running successfully on g5.2xlarge
ec2 node
2024/04/11 11:06:15 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
time="2024-04-11T11:06:15Z" level=info msg="Starting dcgm-exporter"
time="2024-04-11T11:06:15Z" level=info msg="DCGM successfully initialized!"
time="2024-04-11T11:06:15Z" level=info msg="Collecting DCP Metrics"
time="2024-04-11T11:06:15Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-11T11:06:15Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-11T11:06:15Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-04-11T11:06:15Z" level=info msg="Starting webserver"
time="2024-04-11T11:06:15Z" level=info msg="Pipeline starting"
time="2024-04-11T11:06:15Z" level=info msg="Listening on" address="[::]:9400"
time="2024-04-11T11:06:15Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
What is the GPU model?
[root@test-machine ~]# nvidia-smi
Wed Apr 10 21:55:27 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:16.0 N/A | N/A |
|ERR! ERR! ERR! N/A / N/A | 0MiB / 23028MiB | N/A Default |
| | | ERR! |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:17.0 Off | 0 |
| 0% 28C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:18.0 Off | 0 |
| 0% 27C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:19.0 Off | 0 |
| 0% 27C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A10G On | 00000000:00:1A.0 Off | 0 |
| 0% 27C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 28C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 27C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 28C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[root@test-machine ~]#
What is the environment?
We are spinning up kubernetes cluster via kops utility and running dcgm-exporter as a daemon sets on GPU ec2 instances. Version of the containerd is
[root@test-machine ~]# containerd --version
containerd github.com/containerd/containerd v1.6.21 3dce8eb055cbb6872793272b4f20ed16117344f8
[root@test-machine ~]#
Version of the kubelet is
[root@test-machine ~]# kubelet --version
Kubernetes v1.24.13
How did you deploy the dcgm-exporter and what is the configuration?
We are deploying dcgm-exporter as a helm chart via argocd.
How to reproduce the issue?
Try to run dcgm-exporter under the containerd on g5.48xlarge
ec2 instance which is using OSS DLAMI Image
ctr image pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9
ctr run --env DCGM_EXPORTER_DEBUG=true --cap-add CAP_SYS_ADMIN --rm nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9 dcgm-exporter
Anything else we need to know?
No response
@eselyavka, Please make sure that the containerd uses NVIDIA runtime: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.
@nvvfedorov may be my runtime is pretty old on those DLAMI images
[root@test-machine ~]# yum info nvidia-container-toolkit
Loaded plugins: dkms-build-requires, extras_suggestions, kernel-livepatch, langpacks, priorities, update-motd, versionlock
Installed Packages
Name : nvidia-container-toolkit
Arch : x86_64
Version : 1.13.5
Release : 1
Size : 2.3 M
Repo : installed
From repo : libnvidia-container
Summary : NVIDIA Container Toolkit
URL : https://github.com/NVIDIA/nvidia-container-toolkit
License : Apache-2.0
Description : Provides tools and utilities to enable GPU support in containers.
I do not see any options in nvidia-ctk
to configure runtime for containerd
[root@test-machine ~]# nvidia-ctk runtime configure --help
NAME:
NVIDIA Container Toolkit CLI runtime configure - Add a runtime to the specified container engine
USAGE:
NVIDIA Container Toolkit CLI runtime configure [command options] [arguments...]
OPTIONS:
[GR_GG_COF_AWS_Sandbox_TempPoCAccess]
--dry-run update the runtime configuration as required but don't write changes to disk (default: false)
--runtime value the target runtime engine. One of [crio, docker] (default: "docker")
--config value path to the config file for the target runtime
--nvidia-runtime-name value specify the name of the NVIDIA runtime that will be added (default: "nvidia")
--runtime-path value specify the path to the NVIDIA runtime executable (default: "nvidia-container-runtime")
--set-as-default set the specified runtime as the default runtime (default: false)
--help, -h show help (default: false)
As you can see --runtime
accepting only [crio, docker]
no containerd
option.
I guess i have to try to update runtime to the latest version.
@eselyavka, The DCGM exporter depends on the Nvidia container runtime; please try to update the runtime configuration.