fix: infinite loop in Device Plugin caused by stale SocketWatcher state
Description
This PR fixes a critical bug in the SocketWatcher that caused the device plugin to enter a tight restart loop if the socket file was deleted (e.g., by Kubelet) or if the plugin restarted for any reason.
The Problem:
The SocketWatcher maintains a map socketChans of active watchers. When a watcher goroutine exited (due to socket deletion or context cancellation), it closed its notification channel but failed to remove the entry from the map.
Consequently, when the PluginManager attempted to restart the plugin:
- It called
WatchSocketwith the same socket path. WatchSocketfound the existing entry in the map and returned the already-closed channel.- The plugin's main loop (Run) received from this closed channel immediately and exited.
- The loop repeated indefinitely, causing the plugin to restart every ~100ms.
Symptoms:
Logs showing "starting device plugin for resource" and "registering with kubelet" repeating rapidly with no intermediate errors.
Pods failing with "UnexpectedAdmissionError: Allocate failed due to no healthy devices present". This occurred because the plugin was constantly churning, leaving no stable window for Kubelet to allocate devices.
The Fix:
Updated SocketWatcher to ensure that the socket entry is deleted from the socketChans map when the watcher gor outine exits. This ensures that subsequent calls to WatchSocket create a fresh watcher and channel.
Testing:
Added a regression test TestWatchSocketCleanup in socketwatcher_test.go to verify that the map entry is cleaned up and new watchers can be established successfully.
- [x] uses conventional commit messages
- [ ] includes documentation
- [x] adds unit tests
- [ ] relevant PR labels added