azuredisk-csi-driver icon indicating copy to clipboard operation
azuredisk-csi-driver copied to clipboard

[V2] bug: WaitForAttach call should also be able to detach replicas for primary

Open sunpa93 opened this issue 2 years ago • 8 comments

What type of PR is this? /kind bug

What this PR does / why we need it:

  • Currently, only CrdProvisioner.PublishVolume is capable of detaching replicas to make resources available for primary attachment.
  • However, due to async attach behavior and possible races between CrdProvisioner and the CRD controllers, CrdProvisioner.WaitForAttach also needs to be able to detach replicas for primary.

Which issue(s) this PR fixes:

Fixes #

Requirements:

Special notes for your reviewer:

Release note:

none

sunpa93 avatar Sep 28 '22 23:09 sunpa93

/assign @edreed /assign @hccheng72

sunpa93 avatar Sep 29 '22 16:09 sunpa93

@sunpa93: GitHub didn't allow me to assign the following users: hccheng72.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @edreed /assign @hccheng72

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 29 '22 16:09 k8s-ci-robot

/assign @nearora-msft

sunpa93 avatar Sep 29 '22 16:09 sunpa93

Hold pending cut of the next release.

/hold

edreed avatar Sep 29 '22 16:09 edreed

/hold cancel

edreed avatar Sep 30 '22 06:09 edreed

/hold This PR will have some conflicts in usage of WaitForAttach with changes introduced in https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/1531

sunpa93 avatar Sep 30 '22 18:09 sunpa93

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sunpa93 Once this PR has been reviewed and has the lgtm label, please ask for approval from edreed by writing /assign @edreed in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 18 '22 17:10 k8s-ci-robot

/retest

sunpa93 avatar Oct 20 '22 16:10 sunpa93

@sunpa93: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-azuredisk-csi-driver-verify-mainv2 14df7f54ab0398aec11133146c01d1956a344b83 link true /test pull-azuredisk-csi-driver-verify-mainv2
pull-azuredisk-csi-driver-e2e-capz-mainv2 14df7f54ab0398aec11133146c01d1956a344b83 link true /test pull-azuredisk-csi-driver-e2e-capz-mainv2
pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2 14df7f54ab0398aec11133146c01d1956a344b83 link true /test pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Oct 26 '22 20:10 k8s-ci-robot

/retest

edreed avatar Nov 29 '22 17:11 edreed

@sunpa93 Is this change ready to go?

edreed avatar Nov 30 '22 03:11 edreed

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 02 '23 03:03 k8s-triage-robot

Can we move creating attachmentObj right after attachmentObj isn't found? So if the creation fails, the function can return early and save getReplicasToDetach process.

Same idea for: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L587 https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L623 can we check them before the preemption process?

hccheng72 avatar Mar 10 '23 18:03 hccheng72

Can we move creating attachmentObj right after attachmentObj isn't found? So if the creation fails, the function can return early and save getReplicasToDetach process.

Current AzVolumeAttachment CRI creation workflow has two possible scenarios: 1) when there are replicas that need to be detached and 2) when there isn't. When no replica needs to be detached, the workflow is rather simple, we create an AzVolumeAttachment CRI with VolumeAttachmentRequestAnnotation, and wait for the disk lun value to be assigned. However, when there are replicas that need to be detached (i.e. maxShares saturated, or max disk attached saturated), we need to issue a Create call for AzVolumeAttachment CRI without a VolumeAttachRequestAnnotation to prevent race between the attach operation of the new primary and the detach operation of existing replicas. This is why we, at the moment, need to check for replicasToDetach prior to creating the CRI.

As for:

Same idea for:

https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L587

Yes, I think moving this to before calling for getReplicasToDetach makes sense and be beneficial.

For:

https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L623

The error handling we need may exactly be the detachment of the replicas, so it would make sense for us to call getReplicasToDetach prior to checking the error.

sunpa93 avatar Mar 10 '23 19:03 sunpa93

/retest

sunpa93 avatar Mar 17 '23 17:03 sunpa93

@sunpa93: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2 944b31cccffa459b5a6707783a14f26ddc40850a link true /test pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Mar 17 '23 17:03 k8s-ci-robot

/hold cancel

edreed avatar Mar 17 '23 18:03 edreed

/approve /lgtm

edreed avatar Mar 17 '23 18:03 edreed

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edreed, sunpa93

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Mar 17 '23 18:03 k8s-ci-robot

The test failure is unrelated to this change.

edreed avatar Mar 17 '23 18:03 edreed