dcgm-exporter dcgm-exporter is not working on ec2 g5.48xlarge nodes

What is the version?

3.3.5-3.4.1-ubi9

What happened?

We are running dcgm-exported inside the containerd and on g5.48xlarge dcgm-exporter is struggling to come up online with this error

[root@test-machine ~]# ctr run --env DCGM_EXPORTER_DEBUG=true --cap-add CAP_SYS_ADMIN --rm nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9 dcgm-exporter
2024/04/10 21:53:33 maxprocs: Leaving GOMAXPROCS=192: CPU quota undefined
time="2024-04-10T21:53:33Z" level=info msg="Starting dcgm-exporter"
time="2024-04-10T21:53:33Z" level=debug msg="Debug output is enabled"
time="2024-04-10T21:53:33Z" level=debug msg="Command line: /usr/bin/dcgm-exporter"
time="2024-04-10T21:53:33Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/default-counters.csv Address::9400 CollectInterval:30000 Kubernetes:false KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock}"
Error: Failed to initialize NVML
time="2024-04-10T21:53:33Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:269 +0x3d\npanic({0x17dbac0?, 0x28fb390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc00026d1e0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:509 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc00067a960)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:289 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:273 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cbda38?, 0xc000638550}, 0xc00049fb70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc000536a00)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:264 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc00062d080?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:249 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc00062d080, 0xc000536a00, {0xc0002a6050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc00034d200, {0x1cbd920?, 0x29c12a0}, {0xc0002a6050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc00049ff20?, {0xc0002a6050?, 0x1?, 0x1616700?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"

What did you expect to happen?

dcgm-exporter running without any issues and we can see this in the log output, for example we have dcgm-exporter running successfully on g5.2xlarge ec2 node

2024/04/11 11:06:15 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
time="2024-04-11T11:06:15Z" level=info msg="Starting dcgm-exporter"
time="2024-04-11T11:06:15Z" level=info msg="DCGM successfully initialized!"
time="2024-04-11T11:06:15Z" level=info msg="Collecting DCP Metrics"
time="2024-04-11T11:06:15Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-11T11:06:15Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-11T11:06:15Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-11T11:06:15Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-04-11T11:06:15Z" level=info msg="Starting webserver"
time="2024-04-11T11:06:15Z" level=info msg="Pipeline starting"
time="2024-04-11T11:06:15Z" level=info msg="Listening on" address="[::]:9400"
time="2024-04-11T11:06:15Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

What is the GPU model?

[root@test-machine ~]# nvidia-smi
Wed Apr 10 21:55:27 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:16.0 N/A |                  N/A |
|ERR!  ERR! ERR!               N/A /  N/A |      0MiB / 23028MiB |     N/A      Default |
|                                         |                      |                 ERR! |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:17.0 Off |                    0 |
|  0%   28C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:18.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:19.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A10G                    On  | 00000000:00:1A.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   28C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   27C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   28C    P8               9W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[root@test-machine ~]#

What is the environment?

We are spinning up kubernetes cluster via kops utility and running dcgm-exporter as a daemon sets on GPU ec2 instances. Version of the containerd is

[root@test-machine ~]# containerd --version
containerd github.com/containerd/containerd v1.6.21 3dce8eb055cbb6872793272b4f20ed16117344f8
[root@test-machine ~]#

Version of the kubelet is

[root@test-machine ~]# kubelet --version
Kubernetes v1.24.13

How did you deploy the dcgm-exporter and what is the configuration?

We are deploying dcgm-exporter as a helm chart via argocd.

How to reproduce the issue?

Try to run dcgm-exporter under the containerd on g5.48xlarge ec2 instance which is using OSS DLAMI Image

ctr image pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9
ctr run --env DCGM_EXPORTER_DEBUG=true --cap-add CAP_SYS_ADMIN --rm nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubi9 dcgm-exporter

Anything else we need to know?

No response

Apr 11 '24 15:04 eselyavka

@eselyavka, Please make sure that the containerd uses NVIDIA runtime: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.

Apr 11 '24 15:04 nvvfedorov

@nvvfedorov may be my runtime is pretty old on those DLAMI images

[root@test-machine ~]# yum info nvidia-container-toolkit
Loaded plugins: dkms-build-requires, extras_suggestions, kernel-livepatch, langpacks, priorities, update-motd, versionlock
Installed Packages
Name        : nvidia-container-toolkit
Arch        : x86_64
Version     : 1.13.5
Release     : 1
Size        : 2.3 M
Repo        : installed
From repo   : libnvidia-container
Summary     : NVIDIA Container Toolkit
URL         : https://github.com/NVIDIA/nvidia-container-toolkit
License     : Apache-2.0
Description : Provides tools and utilities to enable GPU support in containers.

I do not see any options in nvidia-ctk to configure runtime for containerd

[root@test-machine ~]# nvidia-ctk runtime configure --help
NAME:
   NVIDIA Container Toolkit CLI runtime configure - Add a runtime to the specified container engine

USAGE:
   NVIDIA Container Toolkit CLI runtime configure [command options] [arguments...]

OPTIONS:
[GR_GG_COF_AWS_Sandbox_TempPoCAccess]
   --dry-run                    update the runtime configuration as required but don't write changes to disk (default: false)
   --runtime value              the target runtime engine. One of [crio, docker] (default: "docker")
   --config value               path to the config file for the target runtime
   --nvidia-runtime-name value  specify the name of the NVIDIA runtime that will be added (default: "nvidia")
   --runtime-path value         specify the path to the NVIDIA runtime executable (default: "nvidia-container-runtime")
   --set-as-default             set the specified runtime as the default runtime (default: false)
   --help, -h                   show help (default: false)

As you can see --runtime accepting only [crio, docker] no containerd option.

I guess i have to try to update runtime to the latest version.

Apr 11 '24 17:04 eselyavka

@eselyavka, The DCGM exporter depends on the Nvidia container runtime; please try to update the runtime configuration.

Apr 15 '24 14:04 nvvfedorov

dcgm-exporter dcgm-exporter copied to clipboard

dcgm-exporter is not working on ec2 g5.48xlarge nodes

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

dcgm-exporter
dcgm-exporter copied to clipboard