Device driver panics randomly with unknown error
1. Issue or feature description
Somewhere between 1 in 10 and 1 in 2 times, the device plugin crashes with:
2022/10/19 18:51:07 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error getting device handle for index '0': Unknown Error
goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc00011c278)
/build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e3c58?, {0xc0001ce460, 0x9, 0xe}, 0x9?)
/build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001ce460, 0x9, 0xe})
/build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001ce460?)
/build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001da9c0, {0xca9328?, 0xc00003a050}, {0xc000032060, 0x2, 0x2})
/build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
/build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
/build/cmd/nvidia-device-plugin/main.go:91 +0x665
When this happens, it errors out a couple of times and then just freezes without reporting any GPUs.
2. Steps to reproduce the issue
Start enough AWS EC2 instances with GPU in a kubernetes cluster.
@olemarkus when a node goes into this state, is the GPU visible on the node? This can be confirmed by running nvidia-smi on the node (possibly after chrooting into the driver root if the driver container is being used).
@elezar Hey, I am also experiencing this issue in the same environment, checking on the node itself, the GPU is visible and active.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.