azuredisk-csi-driver
azuredisk-csi-driver copied to clipboard
[V2] bug: WaitForAttach call should also be able to detach replicas for primary
What type of PR is this? /kind bug
What this PR does / why we need it:
- Currently, only CrdProvisioner.PublishVolume is capable of detaching replicas to make resources available for primary attachment.
- However, due to async attach behavior and possible races between CrdProvisioner and the CRD controllers, CrdProvisioner.WaitForAttach also needs to be able to detach replicas for primary.
Which issue(s) this PR fixes:
Fixes #
Requirements:
- [ ] uses conventional commit messages
- [ ] includes documentation
- [ ] adds unit tests
- [ ] tested upgrade from previous version
Special notes for your reviewer:
Release note:
none
/assign @edreed /assign @hccheng72
@sunpa93: GitHub didn't allow me to assign the following users: hccheng72.
Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
/assign @edreed /assign @hccheng72
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign @nearora-msft
Hold pending cut of the next release.
/hold
/hold cancel
/hold This PR will have some conflicts in usage of WaitForAttach
with changes introduced in https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/1531
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: sunpa93
Once this PR has been reviewed and has the lgtm label, please ask for approval from edreed by writing /assign @edreed
in a comment. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/retest
@sunpa93: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
pull-azuredisk-csi-driver-verify-mainv2 | 14df7f54ab0398aec11133146c01d1956a344b83 | link | true | /test pull-azuredisk-csi-driver-verify-mainv2 |
pull-azuredisk-csi-driver-e2e-capz-mainv2 | 14df7f54ab0398aec11133146c01d1956a344b83 | link | true | /test pull-azuredisk-csi-driver-e2e-capz-mainv2 |
pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2 | 14df7f54ab0398aec11133146c01d1956a344b83 | link | true | /test pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
/retest
@sunpa93 Is this change ready to go?
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Can we move creating attachmentObj right after attachmentObj isn't found? So if the creation fails, the function can return early and save getReplicasToDetach
process.
Same idea for: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L587 https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L623 can we check them before the preemption process?
Can we move creating attachmentObj right after attachmentObj isn't found? So if the creation fails, the function can return early and save
getReplicasToDetach
process.
Current AzVolumeAttachment
CRI creation workflow has two possible scenarios: 1) when there are replicas that need to be detached and 2) when there isn't. When no replica needs to be detached, the workflow is rather simple, we create an AzVolumeAttachment
CRI with VolumeAttachmentRequestAnnotation
, and wait for the disk lun value to be assigned. However, when there are replicas that need to be detached (i.e. maxShares saturated, or max disk attached saturated), we need to issue a Create call for AzVolumeAttachment
CRI without a VolumeAttachRequestAnnotation
to prevent race between the attach operation of the new primary and the detach operation of existing replicas. This is why we, at the moment, need to check for replicasToDetach
prior to creating the CRI.
As for:
Same idea for:
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L587
Yes, I think moving this to before calling for getReplicasToDetach
makes sense and be beneficial.
For:
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/354043cc23ab44baa242a9af406933f058ebd0db/pkg/provisioner/crdprovisioner.go#L623
The error handling we need may exactly be the detachment of the replicas, so it would make sense for us to call getReplicasToDetach
prior to checking the error.
/retest
@sunpa93: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2 | 944b31cccffa459b5a6707783a14f26ddc40850a | link | true | /test pull-azuredisk-csi-driver-e2e-capz-windows-2019-mainv2 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
/hold cancel
/approve /lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: edreed, sunpa93
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [edreed]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
The test failure is unrelated to this change.