node-problem-detector
node-problem-detector copied to clipboard
Clearing Old Conditions
we have observed that when changing our system log monitor configurations to omit a previously watched condition, the condition persists on the node object.
I have added a bool flag --delete-deprecated-conditions and stringSliceFlag --deprecated-condition-types, plus a handler into the k8sexporter that will delete conditions from a node object on exporter initialization.
would this community be interested in a PR that supplies this feature?
here's example output, i'll add a PR shortly
I0204 17:20:21.902108 7 problem_client.go:127] Deleting deprecated conditions [GPUMMUErrorXid31 JournaldCGroupOOMKilling JournaldGPUApplicationError JournaldGPUECCUncorrectableError JournaldGPUFallenOffBus JournaldGPUFault JournaldGPUGSPTimeoutXid119 JournaldGPUInvalidPushBuffer JournaldGPURowRemapFailure JournaldGPUWantsReset JournaldHardwareErrorCorrected JournaldHardwareErrorFatal JournaldHardwareErrorInfo JournaldHardwareErrorInterruptCPU JournaldHardwareErrorInterruptMemory JournaldHardwareErrorInterruptPCIe JournaldHardwareErrorInterruptUnknown JournaldHardwareErrorRecoverable JournaldKernelDeadlock JournaldKernelFailedToGetEntry JournaldKernelFailedToGetNextEntry JournaldKernelHardlock JournaldKernelOops JournaldKernelWatchLoopStarted JournaldLocalDiskErrors JournaldNFSStorageFault JournaldNVSwitchFailure JournaldNVSXidNonFatal JournaldPCIAER JournaldPersistentStorageFault JournaldReadonlyFilesystem JournaldSystemOOMKilling JournaldTaskHung JournaldUnregisterNetDevice] (if present)...
I0204 17:20:21.911443 7 problem_client.go:140] Deleting deprecated condition JournaldGPUApplicationError
I0204 17:20:21.911461 7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptUnknown
I0204 17:20:21.911465 7 problem_client.go:140] Deleting deprecated condition JournaldGPUFallenOffBus
I0204 17:20:21.911467 7 problem_client.go:140] Deleting deprecated condition JournaldPersistentStorageFault
I0204 17:20:21.911470 7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptMemory
I0204 17:20:21.911472 7 problem_client.go:140] Deleting deprecated condition JournaldReadonlyFilesystem
I0204 17:20:21.911474 7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorFatal
I0204 17:20:21.911477 7 problem_client.go:140] Deleting deprecated condition JournaldKernelDeadlock
I0204 17:20:21.911480 7 problem_client.go:140] Deleting deprecated condition JournaldLocalDiskErrors
I0204 17:20:21.911483 7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptPCIe
I0204 17:20:21.911486 7 problem_client.go:140] Deleting deprecated condition JournaldKernelHardlock
I0204 17:20:21.911490 7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptCPU
I0204 17:20:21.911492 7 problem_client.go:140] Deleting deprecated condition JournaldGPURowRemapFailure
I0204 17:20:21.911494 7 problem_client.go:140] Deleting deprecated condition JournaldGPUECCUncorrectableError
I0204 17:20:21.911496 7 problem_client.go:140] Deleting deprecated condition JournaldGPUWantsReset
I0204 17:20:21.911498 7 problem_client.go:140] Deleting deprecated condition JournaldGPUFault
I0204 17:20:21.911500 7 problem_client.go:140] Deleting deprecated condition JournaldGPUGSPTimeoutXid119
I0204 17:20:21.911502 7 problem_client.go:140] Deleting deprecated condition JournaldGPUInvalidPushBuffer
removed 120k+ conditions over 20k+ nodes using the linked PR's code just this morning. this is in addition to 4k-6k worth of conditions i removed while testing this 3 weeks back
NPD treat node condition as permanent issue of the node. If there is any remedy system that fixes the issue, that should also be responsible for cleaning up the conditions. Why would we need NPD to do such cleanup?
when changing our system log monitor configurations
Is this mostly for dev and test purpose instead of for production use cases?
the solution presented in the attached PR is for a situation where a condition has been removed from NPD's config, but still exists in whatever state the. config would have applied it on nodes in your fleet.
say, for example, you're running NPD as a daemonset and you've configured a journald monitor to apply conditions as follows:
{
"plugin": "journald",
"pluginConfig": {
"source": "kubelet"
},
"lookback": "5m",
"logPath": "/var/log/journal",
"bufferSize": 10,
"conditions": [
{
"type": "RunContainerError",
"reason": "NoRunContainerError",
"message": "No RunContainerErrors present"
}
],
"rules": [
{
"type": "permanent",
"condition": "RunContainerError",
"reason": "ContextDeadlineExceeded",
"pattern": ".*rror syncing pod.*RunContainerError.*context deadline exceeded.*"
}
]
}
and 5 out of 100 nodes get this condition applied.
you then go and change the config to look for a more broad error pattern, maybe it looks like this:
{
"plugin": "journald",
"pluginConfig": {
"source": "kubelet"
},
"lookback": "5m",
"logPath": "/var/log/journal",
"bufferSize": 10,
"conditions": [
{
"type": "ErrorSyncingPod",
"reason": "NoErrorSyncingPod",
"message": "No ErrorSyncingPod present"
}
],
"rules": [
{
"type": "permanent",
"condition": "ErrorSyncingPod",
"reason": "RunContainerError",
"pattern": ".*rror syncing pod.*"
}
]
}
after the pods come back up, all 100 nodes will still have the condition RunContainerError present (either true or false), and will also now have ErrorSyncingPod present.
the issue is, we are no longer watching for the pattern which presented the RunContainerError condition, which will leave the 5 nodes with True status on the condition lingering, as well as all nodes will have a lingering condition that isn't being monitored by anything.
using --delete-deprecated-conditions --deprecated-condition-types="RunContainerError" will remove that condition from nodes using the supplied solution.
at my workplace, this is actively in production today.
I see. So the use case is adding additional APIs to the config, so that you can declare which conditions were previously added by NPD, but now no longer needed and want to have them cleaned up. Then during NPD startup time, those get removed.
How often do you need this btw? We had the assumption that the config would be stable in general. Why don't you need those deprecated conditions any more?
in some cases, log lines change and language for the condition name may need to follow (this can happen on kernel upgrades, driver upgrades etc).
sometimes we change permanent conditions to temporary conditions, and we want to remove the permanent ones from the nodes
in some cases we have taken a single pattern and split it into multiple, more specific patterns.
in all cases we don't want "unmanaged" conditions laying around on our fleet, so we remove them with these flags.
i wrote this feature late last year (this issue and PR are quite old now ;) ) and we have updated the list of deprecated conditions twice by now.
The feature of clearing the condition is very useful. We have recently encountered a similar problem. If this PR has been merged into the master branch? How should I use it?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle rotten