aws-fsx-csi-driver
aws-fsx-csi-driver copied to clipboard
Better handling of Lustre volume that go to unrecoverable Failed status
Is your feature request related to a problem?/Why is this needed From time to time I get Lustre volumes in "Failed" state, usually with the next message: which I assume is due to lack of capacity in the AWS side (as I am not reaching my quota limits).
The issue comes with the PVC, as it stays in Pending state forever, while the volume it represents is in a state that is not recoverable. See:
And the file system in the AWS console:
The PVC keeps looping waiting for the volume to be created, while the FSx volume is just failed, and that will not change.
/feature
Describe the solution you'd like in detail My proposed solution does not solve the issue directly, but at least allows you to manage the problem yourself. I would like that the PVC will go to a different state, where it is clear that it is in an unrecoverable situation. This way you could handle the "Failed" situation from within your software without using the AWS API, just by chcking the status of the PVC
Describe alternatives you've considered
- Ideally, when the Lustre volume fails to be created in this manner, I would like the driver to do something to retry the creation of the volume, though I don't know if the AWS SDK allows for this.
- Other alternative could be that the driver creates a new volume when the creation fails, but I can imagine that this could lead to issues in cases where it also goes to "Failed" but for different reasons
Additional context Another effect of the current situation is that when the PVC is removed, it leaves behind the Lustre volume in AWS, so you need to cleanup manually.
Also note that if you remove the Failed volume, the driver will create a new one, becoming healthy again (if the new one does not also go to Failed state)
Edit: add the Additional context
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
@jacobwolfaws This I haven't seen happening for a long while, though not sure if has to do with changes on the AWS side or that I just didn't happen to run into capacity issues. Feel free to close it, as it seems also that there are no other people reporting it.
@kanor1306 thanks updating this! Going to leave this thread open, because I do agree we need to improve our capacity shortage messaging
/lifecycle frozen