k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Device driver panics randomly with unknown error

Open olemarkus opened this issue 3 years ago • 3 comments

1. Issue or feature description

Somewhere between 1 in 10 and 1 in 2 times, the device plugin crashes with:

2022/10/19 18:51:07 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error getting device handle for index '0': Unknown Error

goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc00011c278)
        /build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e3c58?, {0xc0001ce460, 0x9, 0xe}, 0x9?)
        /build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001ce460, 0x9, 0xe})
        /build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001ce460?)
        /build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001da9c0, {0xca9328?, 0xc00003a050}, {0xc000032060, 0x2, 0x2})
        /build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
        /build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
        /build/cmd/nvidia-device-plugin/main.go:91 +0x665

When this happens, it errors out a couple of times and then just freezes without reporting any GPUs.

2. Steps to reproduce the issue

Start enough AWS EC2 instances with GPU in a kubernetes cluster.

olemarkus avatar Oct 19 '22 20:10 olemarkus

@olemarkus when a node goes into this state, is the GPU visible on the node? This can be confirmed by running nvidia-smi on the node (possibly after chrooting into the driver root if the driver container is being used).

elezar avatar Oct 20 '22 08:10 elezar

@elezar Hey, I am also experiencing this issue in the same environment, checking on the node itself, the GPU is visible and active.

Mberga14 avatar Oct 20 '22 09:10 Mberga14

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]