node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

app health monitoring

Open daveoy opened this issue 11 months ago • 5 comments

has anyone thought of adding internal app metrics to show if problem daemons are having any issues?

following on from #1003 , i have added a few internal log events from various places inside the kmsg watcher so that i can track how often watchloops are starting / watchers are being revived

simple things like adding

	k.logCh <- &logtypes.Log{
		Message:   "[npd-internal] Entering watch loop",
		Timestamp: time.Now(),
	}

when we start the watch loop, or

        k.logCh <- &logtypes.Log{
		Message:   "[npd-internal] Reviving kmsg parser",
		Timestamp: time.Now(),
	}

whenever we revive the kmsg parser from inside the watcher. paired with config like:

{
  "plugin": "kmsg",
  "pluginConfig": {
    "revive": "true"
  },
  "logPath": "/dev/kmsg",
  "lookback": "5m",
  "bufferSize": 1000,
  "source": "kernel-monitor",
  "conditions": [
   ...
   ...
   ...
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "WatchLoopStarted",
      "pattern": "\\[npd-internal\\] Entering watch loop.*"
    },
    {
      "type": "temporary",
      "reason": "ParserRevived",
      "pattern": "\\[npd-internal\\] Reviving.*parser.*"
    },
   ...
   ...
   ...
  ]
}

we get prometheus metrics when the exporter is enabled (default) that look like:

# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
   ...
   ...
   ...
problem_counter{reason="ParserRevived"} 1
   ...
   ...
   ...
problem_counter{reason="WatchLoopStarted"} 2
   ...
   ...
   ...

daveoy avatar Jan 09 '25 22:01 daveoy

example PR attached

daveoy avatar Jan 09 '25 22:01 daveoy

#1009 is another example of how app health monitoring can be made better by bubbling up logs from the underlying parser to determine the cause of a channel closure or potential partial channel reads as outlined in the downstream package:

we just add klog logging funcs to an internal logger that satisfies this https://pkg.go.dev/github.com/euank/[email protected]+incompatible/kmsgparser#Logger

so we can take advantage of log statements downstream: https://github.com/euank/go-kmsg-parser/blob/5ba4d492e455a77d25dcf0d2c4acc9f2afebef4e/kmsgparser/kmsgparser.go#L130-L143

daveoy avatar Jan 10 '25 20:01 daveoy

example log that appears in the application with the inclusion of #1009 :

logger.go:18] error reading /dev/kmsg: read /dev/kmsg: broken pipe

daveoy avatar Jan 10 '25 22:01 daveoy

So this is to watch the health of NPD itself instead of other applications, right? I am curious how often you see the problems of NPD. Do you run NPD as daemonset BTW? Do you mind putting the ideas in a design doc and we can discuss?

How about health monitoring of other log watchers and plugin?

wangzhen127 avatar Mar 11 '25 04:03 wangzhen127

So this is to watch the health of NPD itself instead of other applications, right?

exactly yep. we run NPD as a daemonset on 20,000+ nodes.

i can try to put together a design doc, sure!

between kmsg and journald watchers that we configure in our environment i have this health monitoring in place today for both. it is easily added to other monitors.

daveoy avatar Mar 11 '25 21:03 daveoy

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 09 '25 21:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 09 '25 22:07 k8s-triage-robot

/remove-lifecycle rotten

cprivitere avatar Aug 04 '25 21:08 cprivitere

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 02 '25 22:11 k8s-triage-robot