node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

revive kmsg channel if closed

Open daveoy opened this issue 11 months ago • 0 comments

would like to start a conversation around revival of closed kmsg channels.

https://github.com/kubernetes/node-problem-detector/pull/1004 supplies a recovery mechanism which, when configured in the plugin's config, will revive a closed kmsg channel.

i have observed that the channel to /dev/kmsg can be closed unexpectedly in a couple of scenarios:

  • when a daemonset rollout is being performed on a node under high load, and
  • during regular operation when a node is experiencing extremely high load

#1004 could potentially use some improvement, if anyone has any better ideas. what im trying to avoid is re-initializing the conditions set by the watcher.

daveoy avatar Jan 09 '25 19:01 daveoy