node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

Gauge for FilesystemIsReadOnly not downgraded to 0 after fixing the problem

Open sharonosbourne opened this issue 5 years ago • 19 comments

The problem occurred when filesystem went to read only mode. That was fixed, but still in the metrics I was able to see the counter and gauge set up to 1. I conducted a test and multiple times injected the FileSystemIsReadOnly to the /dev/kmsg (https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json):

1 log_monitor.go:160] New status generated: &{Source:kernel-monitor Events:[{Severity:info Timestamp:2020-10-08 06:44:16.09315274 +0000 UTC m=+1331754.148888064 Reason:FilesystemIsReadOnly Message:Node condition ReadonlyFilesystem is now: True, reason: FilesystemIsReadOnly}] Conditions:[{Type:KernelDeadlock Status:False Transition:2020-09-22 20:48:21.98500453 +0000 UTC m=+0.040739839 Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status:True Transition:2020-10-08 06:44:16.09315274 +0000 UTC m=+1331754.148888064 Reason:FilesystemIsReadOnly Message:Remounting filesystem read-only}]}

Still the metrics were shown as 1 and it did not downgraded to 0. Even the the issue with ro filesystem was fixed, still the metric was 1:

problem_counter{reason="FilesystemIsReadOnly"} 1 problem_gauge{reason="FilesystemIsReadOnly",type="ReadonlyFilesystem"} 1

As a workaround the pod was deleted and after that metrics were reset to 0. What is the reason of that behaviour? The type "permanent"? Is deleting a pod the only solution?

kernel-monitor.json

	{
		"type": "permanent",
		"condition": "ReadonlyFilesystem",
		"reason": "FilesystemIsReadOnly",
		"pattern": "Remounting filesystem read-only"
	}

sharonosbourne avatar Oct 08 '20 07:10 sharonosbourne

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jan 06 '21 07:01 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Feb 05 '21 07:02 fejta-bot

We are also facing a similar issue but with many occurences as we are extensively using PVC with GCP disks. With a ot of mounting/unmounting operations, the kernel catches many readonly disks events (not on the node root disk) and consequently node-problem-detector set the node as not ready. We may also find a more precise pattern in kernel-monitor.json to only catch root filesystem events.

adesaegher avatar Feb 17 '21 10:02 adesaegher

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

fejta-bot avatar Mar 19 '21 11:03 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 19 '21 11:03 k8s-ci-robot

This looks like a long standing bug that is still happening. Any suggestions here?

/remove-lifecycle rotten

TaichiHo avatar Feb 15 '24 23:02 TaichiHo

/reopen

wangzhen127 avatar Feb 26 '24 18:02 wangzhen127

@wangzhen127: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 26 '24 18:02 k8s-ci-robot

Are we deploying the NPD as a Linux daemon or a privileged container?

bsdnet avatar Feb 28 '24 00:02 bsdnet

On GKE, it is deployed as linux daemon

wangzhen127 avatar Mar 01 '24 01:03 wangzhen127

@sharonosbourne do you remember if your issue was due to read-only filesystem in non-boot disk?

wangzhen127 avatar Mar 01 '24 01:03 wangzhen127

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 30 '24 01:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 29 '24 01:06 k8s-triage-robot

/remove-lifecycle rotten

aslafy-z avatar Jul 02 '24 15:07 aslafy-z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 30 '24 15:09 k8s-triage-robot

/remove-lifecycle stale

aslafy-z avatar Sep 30 '24 15:09 aslafy-z