Fix race condition in config-manager when label is unset
Summary
Fixes a race condition in the config-manager that causes it to hang indefinitely when the nvidia.com/device-plugin.config label is not set on the node.
Problem
When the node label is not configured, there's a timing-dependent race condition:
- If the Kubernetes informer's
AddFuncfires before the firstGet()call, it setscurrent=""and broadcasts - When
Get()is subsequently called, it findslastRead == current(both empty strings) and waits on the condition variable - No future events wake it up since the label remains unset, causing a permanent hang
This manifests as the init container hanging after printing:
Waiting for change to 'nvidia.com/device-plugin.config' label
Solution
Added an initialized boolean flag to SyncableConfig to track whether Get() has been called at least once. The first Get() call now returns immediately with the current value, avoiding the deadlock. Subsequent Get() calls continue to wait properly when the value hasn't changed.
Fixes #1540
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
@jgehrcke do we do something similar in the k8s-dra-driver? If so, how do we handle the initial synchronization there?
Hey @elezar, did you get a chance to have a look?
Gentle ping here @elezar this is happening quite a lot and requires manual intervention each time
@uristernik we are reviewing. Thanks for your patience.
Quick update here @elezar @jgehrcke, I am running a forked version with this commit and we are not seeing the issue reproduce. We are running at any given time anywhere between 200-550 GPU nodes.
@elezar any chance to get it merged? fix it some other way? We are running the forked version but we don't want to manage the fork forever 🙏
@elezar ping 🙏
@klueska @ArangoGutierrez @cdesiniotis @RenaudWasTaken can someone please have a look on this?
Can anyone please review this?
/cherry-pick release-0.18
Thank you! @cdesiniotis done!
/ok to test ab05acef5a2773b55d577ffd105db36dc1ea3d28
🤖 Backport PR created for release-0.18: #1577 ✅
Thank you @cdesiniotis. Following the link that @elezar sent https://github.com/NVIDIA/k8s-device-plugin/pull/1541#discussion_r2576920269 I think this fix is needed also in mig-manager, do you want me to open a pull request for that? And can you please have a look at https://github.com/NVIDIA/k8s-device-plugin/pull/1481? Today I have to patch the helm chart manually to properly deploy MPS daemonset.
Please let me know if I can help