cloud-provider-openstack icon indicating copy to clipboard operation
cloud-provider-openstack copied to clipboard

[cinder-csi-plugin] Add finalizer to VolumeSnapshot when create PVC with the VolumeSnapshot

Open jeffyjf opened this issue 2 years ago • 21 comments

What this PR does / why we need it:

This PR implement a PVC watcher. When the watcher receive a PVC ADDED event and PVC with VolumeSnapshot kind dataSource, the VolumeSnapshot will be added a finalizer. And the finalizer will be removed when the watcher receive the PVC's DELETED event. It can used to protect VolumeSnapshot, avoid that the VS be deleted while there are any PersistentVolume on it.

Which issue this PR fixes(if applicable): fixes #2294

Special notes for reviewers:

Asynchronously add finalizer by watcher seems like unsafety, might result in some potential issues. After the PR [1] merged and released, we need to refactor it and implement a synchronous way to add finalizer.

[1] https://github.com/kubernetes-csi/external-provisioner/pull/1070

Release note:

Add  a command flag `--enable-vs-deletion-protection`. If set, The VS will be added a finalizer while a PVC created base on it, ensure the VS couldn't be deleted before the PVC be deleted.

jeffyjf avatar Sep 26 '23 11:09 jeffyjf

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign jichenjc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 26 '23 11:09 k8s-ci-robot

Hi @jeffyjf. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 26 '23 11:09 k8s-ci-robot

Looks very WIP right now so probably not worth actually running tests, but

/ok-to-test

Note that tests won't run automatically while the PR is marked WIP, but you can run them manually on demand when you're ready with /test all.

mdbooth avatar Sep 27 '23 13:09 mdbooth

Hi @mdbooth, thanks very much for your comments. But I have a bit question.

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

I am a newbie. In my opinion, this cannot be implemented just by watcher. We need to add a webhook to do this. right? If it's not right, do you have any other suggestions?

jeffyjf avatar Oct 02 '23 10:10 jeffyjf

As mentioned in the issue, I do like the concept of adding a finalizer to the snapshot for each volume which requires it. However, I can't think of another instance where finalizers are used this way, so it might be worth some effort to discover if there are any problems with potentially very big lists of finalizers. Imagine, for example, a some CI system which uses a 'golden image' extremely heavily and has 1,000 volumes using it. Would that be ok?

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

Could we add code which adds and persists the snapshot finalizer immediately before creating the PV?

We would also need to add code to remove the finalizer in the deletion flow of the PV strictly before removing the PV finalizer.

In the past few days, I tried a few methods to implement add finalizer before creating the PV synchronously:

  1. I tried implement this in CreateVolume function, but I found I can't get enough context informations by graceful way. Such as, I can't obtain the k8s VolumeSnapshot's name and namespace, this result in I can't add finalizer to it.
  2. I tried add a webhook to intercept the PV's creation, this can add finalizer before creating the PV. But, this will increase the difficulty of deployment and maintenance. Moreover, If a cluster have multiple webhook about PV and another webhook reject the PV's creation after this webhook, I don't konw how to resolve this situation. So, I don't like this way.

I think the NO.1 method is more graceful and appropriate. But, the first methon require external-provisioner sidecar provide more necessary informations. So I think we should submit a KEP to CSI spec in order to we can get enough necessary informations about VolumeSnatshot in CreateVolume function to resolve this issue appropriately.

jeffyjf avatar Oct 09 '23 02:10 jeffyjf

As mentioned in the issue, I do like the concept of adding a finalizer to the snapshot for each volume which requires it. However, I can't think of another instance where finalizers are used this way, so it might be worth some effort to discover if there are any problems with potentially very big lists of finalizers. Imagine, for example, a some CI system which uses a 'golden image' extremely heavily and has 1,000 volumes using it. Would that be ok? However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC. Could we add code which adds and persists the snapshot finalizer immediately before creating the PV? We would also need to add code to remove the finalizer in the deletion flow of the PV strictly before removing the PV finalizer.

In the past few days, I tried a few methods to implement add finalizer before creating the PV synchronously:

  1. I tried implement this in CreateVolume function, but I found I can't get enough context informations by graceful way. Such as, I can't obtain the k8s VolumeSnapshot's name and namespace, this result in I can't add finalizer to it.
  2. I tried add a webhook to intercept the PV's creation, this can add finalizer before creating the PV. But, this will increase the difficulty of deployment and maintenance. Moreover, If a cluster have multiple webhook about PV and another webhook reject the PV's creation after this webhook, I don't konw how to resolve this situation. So, I don't like this way.

I think the NO.1 method is more graceful and appropriate. But, the first methon require external-provisioner sidecar provide more necessary informations. So I think we should submit a KEP to CSI spec in order to we can get enough necessary informations about VolumeSnatshot in CreateVolume function to resolve this issue appropriately.

I've already commit a PR to external-provisioner, in order to we can add finalizer to VolumeSnapshot synchronously in future. But now, let's continue to complete this imperfect solution to fix the issue #2294 and hold back resource leak in production environment asap.

jeffyjf avatar Nov 02 '23 12:11 jeffyjf

Hi @mdbooth, thanks very much for your comments. But I have a bit question.

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

I am a newbie. In my opinion, this cannot be implemented just by watcher. We need to add a webhook to do this. right? If it's not right, do you have any other suggestions?

Hey, what is the problem with adding a finalizer asynchronously? We've did that all the time in Kuryr and it worked totally well when we made sure we're doing it by patching the resource properly.

dulek avatar Nov 08 '23 16:11 dulek

/hold cancel

jeffyjf avatar Nov 09 '23 11:11 jeffyjf

Hi CPO Co-Leads @dulek @jichenjc @kayrus @zetaab

This PR is ready for review. Please take a look at this PR in your spare time. I'm eager to get your advices.

jeffyjf avatar Nov 13 '23 07:11 jeffyjf

Asynchronously add finalizer by watcher seems like unsafety, might result in some potential issues. After the PR [1] merged and released, we need to refactor it and implement a synchronous way to add finalizer.

thanks for the PR, I thought our PR rely on https://github.com/kubernetes-csi/external-provisioner/pull/1070 but read again seems you are proposing a long term solution in external-provisioner then after that's done we will do another round of fix to current one ? so this PR is a temp solution as stop gap ?

jichenjc avatar Nov 14 '23 07:11 jichenjc

Asynchronously add finalizer by watcher seems like unsafety, might result in some potential issues. After the PR [1] merged and released, we need to refactor it and implement a synchronous way to add finalizer.

thanks for the PR, I thought our PR rely on kubernetes-csi/external-provisioner#1070 but read again seems you are proposing a long term solution in external-provisioner then after that's done we will do another round of fix to current one ? so this PR is a temp solution as stop gap ?

Yep, I am not sure when the kubernetes-csi/external-provisioner#1070 can be merged and released. This PR can be temporarily used to hold back resources leak.

jeffyjf avatar Nov 14 '23 08:11 jeffyjf

Hi @mdbooth, thanks very much for your comments. But I have a bit question.

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

I am a newbie. In my opinion, this cannot be implemented just by watcher. We need to add a webhook to do this. right? If it's not right, do you have any other suggestions?

Hey, what is the problem with adding a finalizer asynchronously? We've did that all the time in Kuryr and it worked totally well when we made sure we're doing it by patching the resource properly.

There may just be some potential problems, but I have no concrete example. Maybe in an extremely busy environment, the resource may be deleted before the finalizer be successfully added.

jeffyjf avatar Nov 14 '23 08:11 jeffyjf

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 12 '23 04:12 k8s-ci-robot

@jeffyjf can you rebase on latest master?

kayrus avatar Dec 12 '23 09:12 kayrus

@jeffyjf can you rebase on latest master?

Of course I will rebase it late on.

jeffyjf avatar Dec 13 '23 00:12 jeffyjf

@jeffyjf: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
openstack-cloud-keystone-authentication-authorization-test ee04af8ef62ebcb7c9f9f30321cfab09d22a91b6 link true /test openstack-cloud-keystone-authentication-authorization-test
openstack-cloud-controller-manager-e2e-test ee04af8ef62ebcb7c9f9f30321cfab09d22a91b6 link true /test openstack-cloud-controller-manager-e2e-test
openstack-cloud-csi-manila-e2e-test ee04af8ef62ebcb7c9f9f30321cfab09d22a91b6 link true /test openstack-cloud-csi-manila-e2e-test
openstack-cloud-csi-manila-sanity-test ee04af8ef62ebcb7c9f9f30321cfab09d22a91b6 link true /test openstack-cloud-csi-manila-sanity-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Dec 13 '23 01:12 k8s-ci-robot

@jeffyjf can you fix unit tests?

kayrus avatar Dec 13 '23 08:12 kayrus

@jeffyjf can you fix unit tests?

Of course, I'll fix it asap.

jeffyjf avatar Dec 13 '23 08:12 jeffyjf

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 03 '24 18:01 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 10 '24 06:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 10 '24 07:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Jun 09 '24 08:06 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 09 '24 08:06 k8s-ci-robot