cloud-provider-openstack icon indicating copy to clipboard operation
cloud-provider-openstack copied to clipboard

[cinder-csi-plugin] Orphaned Mount if OpenStack Volume Not Found

Open seanschneeweiss opened this issue 2 years ago • 8 comments

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

We ran into a condition where the ephemeral CSI volume doesn't exist anymore on OpenStack side. This leads to the situation, where the mount remains on the Node and is never removed.

E.g. /dev/sds still mounted but /dev/sds doesn't exist anymore.

See the following journalctl -u kubelet

Sep 26 05:48:39 cluster-md-84b6b9494b-wqgkp kubelet[4344]: E0926 05:48:39.937675    4344 nestedpendingoperations.go:335]
Operation for "{volumeName:kubernetes.io/csi/cinder.csi.openstack.org^2c3c41f9-d080-4046-b488-cdf1205a8058 podName:afafdc4f-51ab-442e-a861-e925ad41d068 nodeName:}" failed. 
No retries permitted until 2022-09-26 05:50:41.937645825 +0000 UTC m=+10879386.722097986 (durationBeforeRetry 2m2s). 
Error: UnmountVolume.TearDown failed for volume "my-volume" (UniqueName: "kubernetes.io/csi/cinder.csi.openstack.org^2c3c41f9-d080-4046-b488-cdf1205a8058") 
pod "afafdc4f-51ab-442e-a861-e925ad41d068" (UID: "afafdc4f-51ab-442e-a861-e925ad41d068") : kubernetes.io/csi: Unmounter.TearDownAt failed: 
rpc error: code = NotFound desc = Volume not found ephemeral-2c3c41f9-d080-4046-b488-cdf1205a8058

It seems that

err = ns.Mount.UnmountPath(targetPath)

is never executed, if kubelet / csi-cinder cannot find the volume on OpenStack.

This might be a bug somewhere around: https://github.com/kubernetes/cloud-provider-openstack/blob/42f4ede114638091b5f6ab851a0873c479eeea32/pkg/csi/cinder/nodeserver.go#L294-L302

What you expected to happen:

If the OpenStack volume and the device don't exist anymore, csi-cinder should unmount the volume.

How to reproduce it: I was not able to reproduce this issue. This might be a race condition where the volume was detached without Kubernetes noticing the detachment (maybe by some manual fix on OpenStack side).

Anything else we need to know?:

Even though this is not reproducible at the moment I would like to ask the general question: Should we unmount volumes that cannot be found anymore in OpenStack (either by ID or name).

Environment:

  • Kubernetes: v1.23.10
  • csi-cinder-plugin: v1.23.0

Sean Schneeweiss [email protected], Mercedes-Benz Tech Innovation GmbH, Provider Information

seanschneeweiss avatar Nov 24 '22 21:11 seanschneeweiss

at least 2 things I think we might address

code = NotFound desc = Volume not found ephemeral-2c3c41f9-d080-4046-b488-cdf1205a8058

this is mis-leading error, should adjust better logging and error report (remove the prefix ephemeral)

err = ns.Mount.UnmountPath(targetPath)

this is challenge, if a volume is not found, from CPO point of view nothing we can do for that volume now we certainly can do the umount but you still get an error that the volume is not found in current logic

so guess we can do the umount even though the volume is not that and for other cases (e.g get volume other error) we still do current logic

jichenjc avatar Nov 25 '22 00:11 jichenjc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 23 '23 01:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 25 '23 01:03 k8s-triage-robot

/remove-lifecycle rotten

seanschneeweiss avatar Apr 18 '23 21:04 seanschneeweiss

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 17 '23 21:07 k8s-triage-robot

/remove-lifecycle stale

mdbooth avatar Jul 18 '23 16:07 mdbooth

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 24 '24 19:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 23 '24 19:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 24 '24 19:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 24 '24 19:03 k8s-ci-robot