csi-driver-nfs icon indicating copy to clipboard operation
csi-driver-nfs copied to clipboard

csi-nfs-controller pod fails

Open bitchecker opened this issue 1 year ago • 5 comments

What happened: I'm using k8s cluster on AWS eks, and I'm using spot instances for node groups. I see that randomly and not on all clusters one pod that manage the CSI NFS controller goes in crashloopback and report these logs:

csi-snapshotter E1029 09:35:37.115611       1 leaderelection.go:340] Failed to update lock optimitically: Operation cannot be fulfilled on leases.coordination.k8s.io "external-snapshotter-leader-nfs-csi-k8s-io": the object has been modified; please apply your changes to the latest version and try again, falling back to slow path

If I delete the pod, all starts without any issue:

nfs Compiler: gc
nfs Driver Name: nfs.csi.k8s.io
nfs Driver Version: v4.9.0
nfs Git Commit: ""
nfs Go Version: go1.22.3
nfs Platform: linux/amd64

It seems that every time (or mostly) that an ec2 is retired and swapped with another one, csi-nfs-controller has some lock that can be solved only with a brutal pod delete.

What you expected to happen: No crashloopback status on a controller pod How to reproduce it: Try to deploy a cluster with spot instances and install nfs-csi-controller and see IF happens and WHEN. Anything else we need to know?:

Environment:

  • CSI Driver version: 4.9.0
  • Kubernetes version (use kubectl version): 1.31
  • OS (e.g. from /etc/os-release): AWS Bottlerocket
  • Install tools: Terraform + Helm
  • Others:

bitchecker avatar Oct 29 '24 09:10 bitchecker

also having this issue.

jeneb7297 avatar Jan 07 '25 13:01 jeneb7297

Do anyone got the solution for it.I am also facing this issue

niranjandarshann avatar Feb 19 '25 07:02 niranjandarshann

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 20 '25 08:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 19 '25 09:06 k8s-triage-robot

I can understand the "issue" about "lacks enough active contributors" but closing the issue is not a way to address problems!

The issue is still present and should be resolved because controller is not working as expected.

bitchecker avatar Jun 19 '25 10:06 bitchecker

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jul 19 '25 11:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 19 '25 11:07 k8s-ci-robot