cloud-provider-openstack
cloud-provider-openstack copied to clipboard
[cinder-csi-plugin] Orphaned Mount if OpenStack Volume Not Found
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
We ran into a condition where the ephemeral CSI volume doesn't exist anymore on OpenStack side. This leads to the situation, where the mount remains on the Node and is never removed.
E.g. /dev/sds still mounted but /dev/sds doesn't exist anymore.
See the following journalctl -u kubelet
Sep 26 05:48:39 cluster-md-84b6b9494b-wqgkp kubelet[4344]: E0926 05:48:39.937675 4344 nestedpendingoperations.go:335]
Operation for "{volumeName:kubernetes.io/csi/cinder.csi.openstack.org^2c3c41f9-d080-4046-b488-cdf1205a8058 podName:afafdc4f-51ab-442e-a861-e925ad41d068 nodeName:}" failed.
No retries permitted until 2022-09-26 05:50:41.937645825 +0000 UTC m=+10879386.722097986 (durationBeforeRetry 2m2s).
Error: UnmountVolume.TearDown failed for volume "my-volume" (UniqueName: "kubernetes.io/csi/cinder.csi.openstack.org^2c3c41f9-d080-4046-b488-cdf1205a8058")
pod "afafdc4f-51ab-442e-a861-e925ad41d068" (UID: "afafdc4f-51ab-442e-a861-e925ad41d068") : kubernetes.io/csi: Unmounter.TearDownAt failed:
rpc error: code = NotFound desc = Volume not found ephemeral-2c3c41f9-d080-4046-b488-cdf1205a8058
It seems that
err = ns.Mount.UnmountPath(targetPath)
is never executed, if kubelet / csi-cinder cannot find the volume on OpenStack.
This might be a bug somewhere around: https://github.com/kubernetes/cloud-provider-openstack/blob/42f4ede114638091b5f6ab851a0873c479eeea32/pkg/csi/cinder/nodeserver.go#L294-L302
What you expected to happen:
If the OpenStack volume and the device don't exist anymore, csi-cinder should unmount the volume.
How to reproduce it: I was not able to reproduce this issue. This might be a race condition where the volume was detached without Kubernetes noticing the detachment (maybe by some manual fix on OpenStack side).
Anything else we need to know?:
Even though this is not reproducible at the moment I would like to ask the general question: Should we unmount volumes that cannot be found anymore in OpenStack (either by ID or name).
Environment:
- Kubernetes: v1.23.10
- csi-cinder-plugin: v1.23.0
Sean Schneeweiss [email protected], Mercedes-Benz Tech Innovation GmbH, Provider Information
at least 2 things I think we might address
code = NotFound desc = Volume not found ephemeral-2c3c41f9-d080-4046-b488-cdf1205a8058
this is mis-leading error, should adjust better logging and error report (remove the prefix ephemeral)
err = ns.Mount.UnmountPath(targetPath)
this is challenge, if a volume is not found, from CPO point of view nothing we can do for that volume now we certainly can do the umount but you still get an error that the volume is not found in current logic
so guess we can do the umount even though the volume is not that and for other cases (e.g get volume other error) we still do current logic
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.