k8s-device-plugin Fix race condition in config-manager when label is unset

Summary

Fixes a race condition in the config-manager that causes it to hang indefinitely when the nvidia.com/device-plugin.config label is not set on the node.

Problem

When the node label is not configured, there's a timing-dependent race condition:

If the Kubernetes informer's AddFunc fires before the first Get() call, it sets current="" and broadcasts
When Get() is subsequently called, it finds lastRead == current (both empty strings) and waits on the condition variable
No future events wake it up since the label remains unset, causing a permanent hang

This manifests as the init container hanging after printing:

Waiting for change to 'nvidia.com/device-plugin.config' label

Solution

Added an initialized boolean flag to SyncableConfig to track whether Get() has been called at least once. The first Get() call now returns immediately with the current value, avoiding the deadlock. Subsequent Get() calls continue to wait properly when the value hasn't changed.

Fixes #1540

Nov 30 '25 13:11 uristernik

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Nov 30 '25 13:11 copy-pr-bot[bot]

@jgehrcke do we do something similar in the k8s-dra-driver? If so, how do we handle the initial synchronization there?

Dec 01 '25 12:12 elezar

Hey @elezar, did you get a chance to have a look?

Dec 02 '25 20:12 uristernik

Gentle ping here @elezar this is happening quite a lot and requires manual intervention each time

Dec 03 '25 14:12 uristernik

@uristernik we are reviewing. Thanks for your patience.

Dec 05 '25 09:12 elezar

Quick update here @elezar @jgehrcke, I am running a forked version with this commit and we are not seeing the issue reproduce. We are running at any given time anywhere between 200-550 GPU nodes. CleanShot 2025-12-14 at 08 37 08@2x

Dec 14 '25 06:12 uristernik

@elezar any chance to get it merged? fix it some other way? We are running the forked version but we don't want to manage the fork forever 🙏

Dec 21 '25 09:12 uristernik

@elezar ping 🙏

Dec 28 '25 14:12 uristernik

@klueska @ArangoGutierrez @cdesiniotis @RenaudWasTaken can someone please have a look on this?

Dec 31 '25 10:12 uristernik

Can anyone please review this?

Jan 05 '26 13:01 uristernik

/cherry-pick release-0.18

Jan 06 '26 18:01 cdesiniotis

Thank you! @cdesiniotis done!

Jan 07 '26 07:01 uristernik

/ok to test ab05acef5a2773b55d577ffd105db36dc1ea3d28

Jan 07 '26 19:01 cdesiniotis

🤖 Backport PR created for release-0.18: #1577 ✅

Jan 07 '26 21:01 github-actions[bot]

Thank you @cdesiniotis. Following the link that @elezar sent https://github.com/NVIDIA/k8s-device-plugin/pull/1541#discussion_r2576920269 I think this fix is needed also in mig-manager, do you want me to open a pull request for that? And can you please have a look at https://github.com/NVIDIA/k8s-device-plugin/pull/1481? Today I have to patch the helm chart manually to properly deploy MPS daemonset.

Please let me know if I can help

Jan 08 '26 06:01 uristernik