node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

Clearing Old Conditions

Open daveoy opened this issue 10 months ago • 7 comments

we have observed that when changing our system log monitor configurations to omit a previously watched condition, the condition persists on the node object.

I have added a bool flag --delete-deprecated-conditions and stringSliceFlag --deprecated-condition-types, plus a handler into the k8sexporter that will delete conditions from a node object on exporter initialization.

would this community be interested in a PR that supplies this feature?

daveoy avatar Jan 15 '25 17:01 daveoy

here's example output, i'll add a PR shortly

I0204 17:20:21.902108       7 problem_client.go:127] Deleting deprecated conditions [GPUMMUErrorXid31 JournaldCGroupOOMKilling JournaldGPUApplicationError JournaldGPUECCUncorrectableError JournaldGPUFallenOffBus JournaldGPUFault JournaldGPUGSPTimeoutXid119 JournaldGPUInvalidPushBuffer JournaldGPURowRemapFailure JournaldGPUWantsReset JournaldHardwareErrorCorrected JournaldHardwareErrorFatal JournaldHardwareErrorInfo JournaldHardwareErrorInterruptCPU JournaldHardwareErrorInterruptMemory JournaldHardwareErrorInterruptPCIe JournaldHardwareErrorInterruptUnknown JournaldHardwareErrorRecoverable JournaldKernelDeadlock JournaldKernelFailedToGetEntry JournaldKernelFailedToGetNextEntry JournaldKernelHardlock JournaldKernelOops JournaldKernelWatchLoopStarted JournaldLocalDiskErrors JournaldNFSStorageFault JournaldNVSwitchFailure JournaldNVSXidNonFatal JournaldPCIAER JournaldPersistentStorageFault JournaldReadonlyFilesystem JournaldSystemOOMKilling JournaldTaskHung JournaldUnregisterNetDevice] (if present)...
I0204 17:20:21.911443       7 problem_client.go:140] Deleting deprecated condition JournaldGPUApplicationError
I0204 17:20:21.911461       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptUnknown
I0204 17:20:21.911465       7 problem_client.go:140] Deleting deprecated condition JournaldGPUFallenOffBus
I0204 17:20:21.911467       7 problem_client.go:140] Deleting deprecated condition JournaldPersistentStorageFault
I0204 17:20:21.911470       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptMemory
I0204 17:20:21.911472       7 problem_client.go:140] Deleting deprecated condition JournaldReadonlyFilesystem
I0204 17:20:21.911474       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorFatal
I0204 17:20:21.911477       7 problem_client.go:140] Deleting deprecated condition JournaldKernelDeadlock
I0204 17:20:21.911480       7 problem_client.go:140] Deleting deprecated condition JournaldLocalDiskErrors
I0204 17:20:21.911483       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptPCIe
I0204 17:20:21.911486       7 problem_client.go:140] Deleting deprecated condition JournaldKernelHardlock
I0204 17:20:21.911490       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptCPU
I0204 17:20:21.911492       7 problem_client.go:140] Deleting deprecated condition JournaldGPURowRemapFailure
I0204 17:20:21.911494       7 problem_client.go:140] Deleting deprecated condition JournaldGPUECCUncorrectableError
I0204 17:20:21.911496       7 problem_client.go:140] Deleting deprecated condition JournaldGPUWantsReset
I0204 17:20:21.911498       7 problem_client.go:140] Deleting deprecated condition JournaldGPUFault
I0204 17:20:21.911500       7 problem_client.go:140] Deleting deprecated condition JournaldGPUGSPTimeoutXid119
I0204 17:20:21.911502       7 problem_client.go:140] Deleting deprecated condition JournaldGPUInvalidPushBuffer

daveoy avatar Feb 04 '25 19:02 daveoy

removed 120k+ conditions over 20k+ nodes using the linked PR's code just this morning. this is in addition to 4k-6k worth of conditions i removed while testing this 3 weeks back

daveoy avatar Feb 04 '25 19:02 daveoy

NPD treat node condition as permanent issue of the node. If there is any remedy system that fixes the issue, that should also be responsible for cleaning up the conditions. Why would we need NPD to do such cleanup?

when changing our system log monitor configurations

Is this mostly for dev and test purpose instead of for production use cases?

wangzhen127 avatar Mar 11 '25 06:03 wangzhen127

the solution presented in the attached PR is for a situation where a condition has been removed from NPD's config, but still exists in whatever state the. config would have applied it on nodes in your fleet.

say, for example, you're running NPD as a daemonset and you've configured a journald monitor to apply conditions as follows:

{
  "plugin": "journald",
  "pluginConfig": {
        "source": "kubelet"
  },
  "lookback": "5m",
  "logPath": "/var/log/journal",
  "bufferSize": 10,
  "conditions": [
    {
      "type": "RunContainerError",
      "reason": "NoRunContainerError",
      "message": "No RunContainerErrors present"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "RunContainerError",
      "reason": "ContextDeadlineExceeded",
      "pattern": ".*rror syncing pod.*RunContainerError.*context deadline exceeded.*"
    }
  ]
}

and 5 out of 100 nodes get this condition applied.

you then go and change the config to look for a more broad error pattern, maybe it looks like this:

{
  "plugin": "journald",
  "pluginConfig": {
        "source": "kubelet"
  },
  "lookback": "5m",
  "logPath": "/var/log/journal",
  "bufferSize": 10,
  "conditions": [
    {
      "type": "ErrorSyncingPod",
      "reason": "NoErrorSyncingPod",
      "message": "No ErrorSyncingPod present"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "ErrorSyncingPod",
      "reason": "RunContainerError",
      "pattern": ".*rror syncing pod.*"
    }
  ]
}

after the pods come back up, all 100 nodes will still have the condition RunContainerError present (either true or false), and will also now have ErrorSyncingPod present.

the issue is, we are no longer watching for the pattern which presented the RunContainerError condition, which will leave the 5 nodes with True status on the condition lingering, as well as all nodes will have a lingering condition that isn't being monitored by anything.

using --delete-deprecated-conditions --deprecated-condition-types="RunContainerError" will remove that condition from nodes using the supplied solution.

at my workplace, this is actively in production today.

daveoy avatar Mar 11 '25 14:03 daveoy

I see. So the use case is adding additional APIs to the config, so that you can declare which conditions were previously added by NPD, but now no longer needed and want to have them cleaned up. Then during NPD startup time, those get removed.

How often do you need this btw? We had the assumption that the config would be stable in general. Why don't you need those deprecated conditions any more?

wangzhen127 avatar Mar 11 '25 15:03 wangzhen127

in some cases, log lines change and language for the condition name may need to follow (this can happen on kernel upgrades, driver upgrades etc).

sometimes we change permanent conditions to temporary conditions, and we want to remove the permanent ones from the nodes

in some cases we have taken a single pattern and split it into multiple, more specific patterns.

in all cases we don't want "unmanaged" conditions laying around on our fleet, so we remove them with these flags.

i wrote this feature late last year (this issue and PR are quite old now ;) ) and we have updated the list of deprecated conditions twice by now.

daveoy avatar Mar 11 '25 18:03 daveoy

The feature of clearing the condition is very useful. We have recently encountered a similar problem. If this PR has been merged into the master branch? How should I use it?

617406160 avatar Mar 25 '25 10:03 617406160

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 25 '25 01:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 25 '25 01:07 k8s-triage-robot

/remove-lifecycle rotten

cprivitere avatar Aug 04 '25 21:08 cprivitere

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 02 '25 22:11 k8s-triage-robot

/remove-lifecycle rotten

ajaysundark avatar Nov 24 '25 05:11 ajaysundark