aws-fsx-csi-driver icon indicating copy to clipboard operation
aws-fsx-csi-driver copied to clipboard

Better handling of Lustre volume that go to unrecoverable Failed status

Open kanor1306 opened this issue 2 years ago • 10 comments

Is your feature request related to a problem?/Why is this needed From time to time I get Lustre volumes in "Failed" state, usually with the next message: Screenshot 2022-01-28 at 11 04 26 which I assume is due to lack of capacity in the AWS side (as I am not reaching my quota limits).

The issue comes with the PVC, as it stays in Pending state forever, while the volume it represents is in a state that is not recoverable. See: Screenshot 2022-01-28 at 10 31 49

And the file system in the AWS console: Screenshot 2022-01-28 at 11 04 18

The PVC keeps looping waiting for the volume to be created, while the FSx volume is just failed, and that will not change.

/feature

Describe the solution you'd like in detail My proposed solution does not solve the issue directly, but at least allows you to manage the problem yourself. I would like that the PVC will go to a different state, where it is clear that it is in an unrecoverable situation. This way you could handle the "Failed" situation from within your software without using the AWS API, just by chcking the status of the PVC

Describe alternatives you've considered

  • Ideally, when the Lustre volume fails to be created in this manner, I would like the driver to do something to retry the creation of the volume, though I don't know if the AWS SDK allows for this.
  • Other alternative could be that the driver creates a new volume when the creation fails, but I can imagine that this could lead to issues in cases where it also goes to "Failed" but for different reasons

Additional context Another effect of the current situation is that when the PVC is removed, it leaves behind the Lustre volume in AWS, so you need to cleanup manually.

Also note that if you remove the Failed volume, the driver will create a new one, becoming healthy again (if the new one does not also go to Failed state)

Edit: add the Additional context

kanor1306 avatar Jan 28 '22 11:01 kanor1306

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 17 '22 17:05 k8s-triage-robot

/remove-lifecycle stale

kanor1306 avatar May 20 '22 11:05 kanor1306

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 18 '22 11:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 17 '22 11:09 k8s-triage-robot

/remove-lifecycle rotten

kanor1306 avatar Sep 20 '22 15:09 kanor1306

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 19 '22 15:12 k8s-triage-robot

/remove-lifecycle stale

jacobwolfaws avatar Dec 22 '22 14:12 jacobwolfaws

@jacobwolfaws This I haven't seen happening for a long while, though not sure if has to do with changes on the AWS side or that I just didn't happen to run into capacity issues. Feel free to close it, as it seems also that there are no other people reporting it.

kanor1306 avatar Dec 22 '22 14:12 kanor1306

@kanor1306 thanks updating this! Going to leave this thread open, because I do agree we need to improve our capacity shortage messaging

jacobwolfaws avatar Dec 22 '22 14:12 jacobwolfaws

/lifecycle frozen

jacobwolfaws avatar Jan 12 '23 21:01 jacobwolfaws