k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Fix race condition in config-manager when label is unset

Open uristernik opened this issue 1 month ago • 6 comments

Summary

Fixes a race condition in the config-manager that causes it to hang indefinitely when the nvidia.com/device-plugin.config label is not set on the node.

Problem

When the node label is not configured, there's a timing-dependent race condition:

  1. If the Kubernetes informer's AddFunc fires before the first Get() call, it sets current="" and broadcasts
  2. When Get() is subsequently called, it finds lastRead == current (both empty strings) and waits on the condition variable
  3. No future events wake it up since the label remains unset, causing a permanent hang

This manifests as the init container hanging after printing:

Waiting for change to 'nvidia.com/device-plugin.config' label

Solution

Added an initialized boolean flag to SyncableConfig to track whether Get() has been called at least once. The first Get() call now returns immediately with the current value, avoiding the deadlock. Subsequent Get() calls continue to wait properly when the value hasn't changed.

Fixes #1540

uristernik avatar Nov 30 '25 13:11 uristernik

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Nov 30 '25 13:11 copy-pr-bot[bot]

@jgehrcke do we do something similar in the k8s-dra-driver? If so, how do we handle the initial synchronization there?

elezar avatar Dec 01 '25 12:12 elezar

Hey @elezar, did you get a chance to have a look?

uristernik avatar Dec 02 '25 20:12 uristernik

Gentle ping here @elezar this is happening quite a lot and requires manual intervention each time

uristernik avatar Dec 03 '25 14:12 uristernik

@uristernik we are reviewing. Thanks for your patience.

elezar avatar Dec 05 '25 09:12 elezar

Quick update here @elezar @jgehrcke, I am running a forked version with this commit and we are not seeing the issue reproduce. We are running at any given time anywhere between 200-550 GPU nodes. CleanShot 2025-12-14 at 08 37 08@2x

uristernik avatar Dec 14 '25 06:12 uristernik

@elezar any chance to get it merged? fix it some other way? We are running the forked version but we don't want to manage the fork forever 🙏

uristernik avatar Dec 21 '25 09:12 uristernik

@elezar ping 🙏

uristernik avatar Dec 28 '25 14:12 uristernik

@klueska @ArangoGutierrez @cdesiniotis @RenaudWasTaken can someone please have a look on this?

uristernik avatar Dec 31 '25 10:12 uristernik

Can anyone please review this?

uristernik avatar Jan 05 '26 13:01 uristernik

/cherry-pick release-0.18

cdesiniotis avatar Jan 06 '26 18:01 cdesiniotis

Thank you! @cdesiniotis done!

uristernik avatar Jan 07 '26 07:01 uristernik

/ok to test ab05acef5a2773b55d577ffd105db36dc1ea3d28

cdesiniotis avatar Jan 07 '26 19:01 cdesiniotis

🤖 Backport PR created for release-0.18: #1577 ✅

github-actions[bot] avatar Jan 07 '26 21:01 github-actions[bot]

Thank you @cdesiniotis. Following the link that @elezar sent https://github.com/NVIDIA/k8s-device-plugin/pull/1541#discussion_r2576920269 I think this fix is needed also in mig-manager, do you want me to open a pull request for that? And can you please have a look at https://github.com/NVIDIA/k8s-device-plugin/pull/1481? Today I have to patch the helm chart manually to properly deploy MPS daemonset.

Please let me know if I can help

uristernik avatar Jan 08 '26 06:01 uristernik