dcgm-exporter Segmentation fault when running with the default configuration for the GPU Operator on kind

What is the version?

3.3.8-3.6.0

What happened?

The dcgm-exporter pod crashed with the following log:

$ kubectl logs -n gpu-operator         nvidia-dcgm-exporter-lq2b7
Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
2024/10/29 11:52:47 maxprocs: Leaving GOMAXPROCS=256: CPU quota undefined
time="2024-10-29T11:52:47Z" level=info msg="Starting dcgm-exporter"
time="2024-10-29T11:52:48Z" level=info msg="DCGM successfully initialized!"
time="2024-10-29T11:52:48Z" level=info msg="Collecting DCP Metrics"
time="2024-10-29T11:52:48Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-10-29T11:52:48Z" level=info msg="Initializing system entities of type: GPU"
SIGSEGV: segmentation violation
PC=0x7ff0fc58804d m=3 sigcode=1 addr=0x18
signal arrived during cgo execution

goroutine 1 gp=0xc0000061c0 m=3 mp=0xc0002c9008 [syscall]:
runtime.cgocall(0x16dce70, 0xc0008436f0)
	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0008436c8 sp=0xc000843690 pc=0x418ccb
github.com/NVIDIA/go-dcgm/pkg/dcgm._Cfunc_dcgmGetDeviceTopology(0x7fffffff, 0x0, 0xc0003f61c0)
	_cgo_gotypes.go:1148 +0x4b fp=0xc0008436f0 sp=0xc0008436c8 pc=0x7e988b
github.com/NVIDIA/go-dcgm/pkg/dcgm.getDeviceTopology(0x0)
	/go/pkg/mod/github.com/!n!v!i!d!i!a/[email protected]/pkg/dcgm/topology.go:103 +0x5d fp=0xc000843798 sp=0xc0008436f0 pc=0x7f105d
github.com/NVIDIA/go-dcgm/pkg/dcgm.getDeviceInfo(0x0)
	/go/pkg/mod/github.com/!n!v!i!d!i!a/[email protected]/pkg/dcgm/device_info.go:220 +0x22a fp=0xc000843988 sp=0xc000843798 pc=0x7ed54a
github.com/NVIDIA/go-dcgm/pkg/dcgm.GetDeviceInfo(0x0?)
	/go/pkg/mod/github.com/!n!v!i!d!i!a/[email protected]/pkg/dcgm/api.go:78 +0x65 fp=0xc000843b60 sp=0xc000843988 pc=0x7e8b25
github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.InitializeGPUInfo({0x8, {{{0x0, {...}, {...}, 0x0, {...}, {...}, {...}, {...}}, {0x0, ...}, ...}, ...}, ...}, ...)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter/system_info.go:372 +0x15f fp=0xc000860cb0 sp=0xc000843b60 pc=0x16cfb9f
github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.InitializeSystemInfo({_, {_, _, _}, {_, _, _}}, {0x1, {0x0, 0x0, ...}, ...}, ...)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter/system_info.go:445 +0x2bc fp=0xc0008690d0 sp=0xc000860cb0 pc=0x16d079c
github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.GetSystemInfo(0xc00061b568, 0x1)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter/gpu_collector.go:72 +0x1e8 fp=0xc00086f510 sp=0xc0008690d0 pc=0x16c3a68
github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter.(*FieldEntityGroupTypeSystemInfo).Load(0xc00038c410, 0x1)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter/field_entity_group_system_info.go:73 +0x1ba fp=0xc000871838 sp=0xc00086f510 pc=0x16c321a
github.com/NVIDIA/dcgm-exporter/pkg/cmd.getFieldEntityGroupTypeSystemInfo(0xc000552db0, 0xc0007028c0)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:424 +0x2e7 fp=0xc000871918 sp=0xc000871838 pc=0x16d9ca7
github.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0xc0007ae340, 0xc00054d130)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:315 +0x15c fp=0xc000871a88 sp=0xc000871918 pc=0x16d8bdc
github.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:287 +0x5b fp=0xc000871ad8 sp=0xc000871a88 pc=0x16d88fb
github.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1dc9e30, 0xc0007ac000}, 0xc000807b90)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1e6 fp=0xc000871b68 sp=0xc000871ad8 pc=0x16d6366
github.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0007ae340)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:278 +0x67 fp=0xc000871bc0 sp=0xc000871b68 pc=0x16d8867
github.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0007ae340?)
	/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:263 +0x13 fp=0xc000871bd8 sp=0xc000871bc0 pc=0x16dbf33
github.com/urfave/cli/v2.(*Command).Run(0xc00077fce0, 0xc0007ae340, {0xc000040090, 0x1, 0x1})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x97d fp=0xc000871e60 sp=0xc000871bd8 pc=0x80f9dd
github.com/urfave/cli/v2.(*App).RunContext(0xc000720400, {0x1dc9d18, 0x2b8ffe0}, {0xc000040090, 0x1, 0x1})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x58b fp=0xc000871ec0 sp=0xc000871e60 pc=0x80c26b
github.com/urfave/cli/v2.(*App).Run(0xc00002df30?, {0xc000040090?, 0x1?, 0x16dd1e0?})
	/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f fp=0xc000871f00 sp=0xc000871ec0 pc=0x80bc8f
main.main()
	/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f fp=0xc000871f50 sp=0xc000871f00 pc=0x16dc1df
runtime.main()
	/usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc000871fe0 sp=0xc000871f50 pc=0x450edd
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000871fe8 sp=0xc000871fe0 pc=0x483f61

...

The following line looks especially suspicious:

github.com/NVIDIA/go-dcgm/pkg/dcgm._Cfunc_dcgmGetDeviceTopology(0x7fffffff, 0x0, 0xc0003f61c0)

The first parameter to this function is meant to be the C-based handle retrieved from calling C.dcgmStartEmbedded() or C.dcgmConnect_v2(). A value of 0x7fffffff here seems wrong.

What did you expect to happen?

The DCGM exporter runs without crashing

What is the GPU model?

A100

What is the environment?

A kind cluster on a DGX-A100 node with the latest GPU operator deployed

How did you deploy the dcgm-exporter and what is the configuration?

With the latest GPU operator

How to reproduce the issue?

No response

Anything else we need to know?

No response

Oct 29 '24 12:10 klueska

How did you deploy the dcgm-exporter? The dcgm-exporter depends on DCGM and Cuda libraries.

Oct 29 '24 14:10 nvvfedorov

The handle value does seem wrong.

I have not been able to reproduce the crash. This function has not changed in years and is called for every GPU on startup, so it doesn't seem likely that there is a bug in this part of the code. It's more likely there's something specific to this environment.

Oct 29 '24 14:10 glowkey