node-problem-detector
node-problem-detector copied to clipboard
app health monitoring
has anyone thought of adding internal app metrics to show if problem daemons are having any issues?
following on from #1003 , i have added a few internal log events from various places inside the kmsg watcher so that i can track how often watchloops are starting / watchers are being revived
simple things like adding
k.logCh <- &logtypes.Log{
Message: "[npd-internal] Entering watch loop",
Timestamp: time.Now(),
}
when we start the watch loop, or
k.logCh <- &logtypes.Log{
Message: "[npd-internal] Reviving kmsg parser",
Timestamp: time.Now(),
}
whenever we revive the kmsg parser from inside the watcher. paired with config like:
{
"plugin": "kmsg",
"pluginConfig": {
"revive": "true"
},
"logPath": "/dev/kmsg",
"lookback": "5m",
"bufferSize": 1000,
"source": "kernel-monitor",
"conditions": [
...
...
...
],
"rules": [
{
"type": "temporary",
"reason": "WatchLoopStarted",
"pattern": "\\[npd-internal\\] Entering watch loop.*"
},
{
"type": "temporary",
"reason": "ParserRevived",
"pattern": "\\[npd-internal\\] Reviving.*parser.*"
},
...
...
...
]
}
we get prometheus metrics when the exporter is enabled (default) that look like:
# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
...
...
...
problem_counter{reason="ParserRevived"} 1
...
...
...
problem_counter{reason="WatchLoopStarted"} 2
...
...
...
example PR attached
#1009 is another example of how app health monitoring can be made better by bubbling up logs from the underlying parser to determine the cause of a channel closure or potential partial channel reads as outlined in the downstream package:
we just add klog logging funcs to an internal logger that satisfies this https://pkg.go.dev/github.com/euank/[email protected]+incompatible/kmsgparser#Logger
so we can take advantage of log statements downstream: https://github.com/euank/go-kmsg-parser/blob/5ba4d492e455a77d25dcf0d2c4acc9f2afebef4e/kmsgparser/kmsgparser.go#L130-L143
example log that appears in the application with the inclusion of #1009 :
logger.go:18] error reading /dev/kmsg: read /dev/kmsg: broken pipe
So this is to watch the health of NPD itself instead of other applications, right? I am curious how often you see the problems of NPD. Do you run NPD as daemonset BTW? Do you mind putting the ideas in a design doc and we can discuss?
How about health monitoring of other log watchers and plugin?
So this is to watch the health of NPD itself instead of other applications, right?
exactly yep. we run NPD as a daemonset on 20,000+ nodes.
i can try to put together a design doc, sure!
between kmsg and journald watchers that we configure in our environment i have this health monitoring in place today for both. it is easily added to other monitors.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale