zfs-localpv
zfs-localpv copied to clipboard
fix(controller): Fix race of csi controller calls on the same volume
Why is this PR required? What issue does it fix?:
When a CreateVolume requests takes longer then 10 seconds the request runs into a client-side timeout and gets retied by kubelet. While processing the new request, the old request is still being canceled on the server-side and the created ZFSVolume CR gets deleted.
| 1 | 2 |
|---|---|
| CreateVolume call | |
| ZFSVolume CR created | |
| waiting for volume creation | |
| client timeout + context cancel | |
| CreateVolume call | |
| ZFSVolume CR already exists | |
| ZFSVolume CR deleted | |
| Return ERR | Return OK without creating a ZFSVolume |
I0916 15:01:49.227721 1 grpc.go:72] GRPC call: /csi.v1.Controller/CreateVolume requests {"accessibility_requirements":{"preferred":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}],"requisite":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}]},"capacity_range":{"required_bytes":5368709120},"name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","parameters":{"compression":"off","csi.storage.k8s.io/pv/name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","csi.storage.k8s.io/pvc/name":"test-pvc0","csi.storage.k8s.io/pvc/namespace":"default","dedup":"off","fstype":"zfs","poolname":"kluster-pool/i04-1-b-128k","recordsize":"128k","shared":"yes","thinprovision":"yes"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"zfs"}},"access_mode":{"mode":1}}]}
I0916 15:01:58.853180 1 controller.go:289] zfs: trying volume creation kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 on node [i04-1-b-node-i041b-0-0]
I0916 15:01:59.104207 1 volume.go:139] zfs: waiting for volume kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 to be created on nodeid i04-1-b-node-i041b-0-0
I0916 15:02:00.105273 1 volume.go:170] zfs: volume kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 provisioning failed on nodeid i04-1-b-node-i041b-0-0 err: zfs: context deadline reached
I0916 15:02:00.235113 1 grpc.go:72] GRPC call: /csi.v1.Controller/CreateVolume requests {"accessibility_requirements":{"preferred":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}],"requisite":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}]},"capacity_range":{"required_bytes":5368709120},"name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","parameters":{"compression":"off","csi.storage.k8s.io/pv/name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","csi.storage.k8s.io/pvc/name":"test-pvc0","csi.storage.k8s.io/pvc/namespace":"default","dedup":"off","fstype":"zfs","poolname":"kluster-pool/i04-1-b-128k","recordsize":"128k","shared":"yes","thinprovision":"yes"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"zfs"}},"access_mode":{"mode":1}}]}
I0916 15:02:00.449153 1 volume.go:221] zfs: deleted the volume pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457
I0916 15:02:00.449299 1 controller.go:466] created the volume kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 on node i04-1-b-node-i041b-0-0
I0916 15:02:00.553778 1 grpc.go:81] GRPC response: {"volume":{"accessible_topology":[{"segments":{"openebs.io/nodeid":"i04-1-b-node-i041b-0-0"}}],"capacity_bytes":5368709120,"volume_context":{"openebs.io/cas-type":"localpv-zfs","openebs.io/poolname":"kluster-pool/i04-1-b-128k"},"volume_id":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457"}}
What this PR does?: Adds a per volume mutex to csi calls to prevent simultaneous requests
Does this PR require any upgrade changes?: No
If the changes in this PR are manually verified, list down the scenarios covered::
Any additional information for your reviewer? : Mention if this PR is part of any design or a continuation of previous PRs
Checklist:
- [ ] Fixes #
- [ ] PR Title follows the convention of
<type>(<scope>): <subject> - [ ] Has the change log section been updated?
- [ ] Commit has unit tests
- [ ] Commit has integration tests
- [ ] (Optional) Are upgrade changes included in this PR? If not, mention the issue/PR to track:
- [ ] (Optional) If documentation changes are required, which issue on https://github.com/openebs/website is used to track them:
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 95.99%. Comparing base (
8c402d3) to head (27ec79f). Report is 6 commits behind head on develop.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## develop #588 +/- ##
========================================
Coverage 95.99% 95.99%
========================================
Files 1 1
Lines 574 574
========================================
Hits 551 551
Misses 19 19
Partials 4 4
| Flag | Coverage Δ | |
|---|---|---|
| bddtests | 95.99% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hmm... this is upsetting. I suppose all of the tests on our Go repos need to use the -race flag on CI pipeline.
I've created this ticket: https://github.com/openebs/openebs/issues/3784
Hmm... this is upsetting. I suppose all of the tests on our Go repos need to use the
-raceflag on CI pipeline. I've created this ticket: openebs/openebs#3784
This isn't a go data race, -race can't detect this issue.
To reproduce this issue I created 50 PVCs with WaitForFirstConsumer and then started a single pod with all 50 volumes to create a high load on both kubelet and openebs.
@Lucaber Please rebase your branch. @tiagolobocastro Are we good to take this in now?
@Lucaber Please rebase your branch. @tiagolobocastro Are we good to take this in now?
Yeah I had approved already
@Lucaber Mark the conversations as resolved if you don't think there is anything more to do before we can merge this.
Merging this once the CI passes. Thanks for the contribution @Lucaber