zfs-localpv icon indicating copy to clipboard operation
zfs-localpv copied to clipboard

fix(controller): Fix race of csi controller calls on the same volume

Open Lucaber opened this issue 1 year ago • 5 comments
trafficstars

Why is this PR required? What issue does it fix?:

When a CreateVolume requests takes longer then 10 seconds the request runs into a client-side timeout and gets retied by kubelet. While processing the new request, the old request is still being canceled on the server-side and the created ZFSVolume CR gets deleted.

1 2
CreateVolume call
ZFSVolume CR created
waiting for volume creation
client timeout + context cancel
CreateVolume call
ZFSVolume CR already exists
ZFSVolume CR deleted
Return ERR Return OK without creating a ZFSVolume
I0916 15:01:49.227721       1 grpc.go:72] GRPC call: /csi.v1.Controller/CreateVolume requests {"accessibility_requirements":{"preferred":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}],"requisite":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}]},"capacity_range":{"required_bytes":5368709120},"name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","parameters":{"compression":"off","csi.storage.k8s.io/pv/name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","csi.storage.k8s.io/pvc/name":"test-pvc0","csi.storage.k8s.io/pvc/namespace":"default","dedup":"off","fstype":"zfs","poolname":"kluster-pool/i04-1-b-128k","recordsize":"128k","shared":"yes","thinprovision":"yes"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"zfs"}},"access_mode":{"mode":1}}]}
I0916 15:01:58.853180       1 controller.go:289] zfs: trying volume creation kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 on node [i04-1-b-node-i041b-0-0]
I0916 15:01:59.104207       1 volume.go:139] zfs: waiting for volume kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 to be created on nodeid i04-1-b-node-i041b-0-0
I0916 15:02:00.105273       1 volume.go:170] zfs: volume kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 provisioning failed on nodeid i04-1-b-node-i041b-0-0 err: zfs: context deadline reached
I0916 15:02:00.235113       1 grpc.go:72] GRPC call: /csi.v1.Controller/CreateVolume requests {"accessibility_requirements":{"preferred":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}],"requisite":[{"segments":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"i04-1-b-node-i041b-0-0","kubernetes.io/os":"linux","m3.services/cluster":"i04-1-b","m3.services/worker-group":"i041b-0","node.systems.mittwald.cloud/placement-group":"i041b-0","openebs.io/nodeid":"i04-1-b-node-i041b-0-0","openebs.io/nodename":"i04-1-b-node-i041b-0-0"}}]},"capacity_range":{"required_bytes":5368709120},"name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","parameters":{"compression":"off","csi.storage.k8s.io/pv/name":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457","csi.storage.k8s.io/pvc/name":"test-pvc0","csi.storage.k8s.io/pvc/namespace":"default","dedup":"off","fstype":"zfs","poolname":"kluster-pool/i04-1-b-128k","recordsize":"128k","shared":"yes","thinprovision":"yes"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"zfs"}},"access_mode":{"mode":1}}]}
I0916 15:02:00.449153       1 volume.go:221] zfs: deleted the volume pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457
I0916 15:02:00.449299       1 controller.go:466] created the volume kluster-pool/i04-1-b-128k/pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457 on node i04-1-b-node-i041b-0-0
I0916 15:02:00.553778       1 grpc.go:81] GRPC response: {"volume":{"accessible_topology":[{"segments":{"openebs.io/nodeid":"i04-1-b-node-i041b-0-0"}}],"capacity_bytes":5368709120,"volume_context":{"openebs.io/cas-type":"localpv-zfs","openebs.io/poolname":"kluster-pool/i04-1-b-128k"},"volume_id":"pvc-eb11c446-f26c-4333-b3da-4ecb9c5b2457"}}

What this PR does?: Adds a per volume mutex to csi calls to prevent simultaneous requests

Does this PR require any upgrade changes?: No

If the changes in this PR are manually verified, list down the scenarios covered::

Any additional information for your reviewer? : Mention if this PR is part of any design or a continuation of previous PRs

Checklist:

  • [ ] Fixes #
  • [ ] PR Title follows the convention of <type>(<scope>): <subject>
  • [ ] Has the change log section been updated?
  • [ ] Commit has unit tests
  • [ ] Commit has integration tests
  • [ ] (Optional) Are upgrade changes included in this PR? If not, mention the issue/PR to track:
  • [ ] (Optional) If documentation changes are required, which issue on https://github.com/openebs/website is used to track them:

Lucaber avatar Sep 17 '24 15:09 Lucaber

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.99%. Comparing base (8c402d3) to head (27ec79f). Report is 6 commits behind head on develop.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #588   +/-   ##
========================================
  Coverage    95.99%   95.99%           
========================================
  Files            1        1           
  Lines          574      574           
========================================
  Hits           551      551           
  Misses          19       19           
  Partials         4        4           
Flag Coverage Δ
bddtests 95.99% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Sep 21 '24 18:09 codecov-commenter

Hmm... this is upsetting. I suppose all of the tests on our Go repos need to use the -race flag on CI pipeline. I've created this ticket: https://github.com/openebs/openebs/issues/3784

niladrih avatar Sep 25 '24 10:09 niladrih

Hmm... this is upsetting. I suppose all of the tests on our Go repos need to use the -race flag on CI pipeline. I've created this ticket: openebs/openebs#3784

This isn't a go data race, -race can't detect this issue. To reproduce this issue I created 50 PVCs with WaitForFirstConsumer and then started a single pod with all 50 volumes to create a high load on both kubelet and openebs.

Lucaber avatar Sep 25 '24 11:09 Lucaber

@Lucaber Please rebase your branch. @tiagolobocastro Are we good to take this in now?

Abhinandan-Purkait avatar Oct 21 '24 06:10 Abhinandan-Purkait

@Lucaber Please rebase your branch. @tiagolobocastro Are we good to take this in now?

Yeah I had approved already

tiagolobocastro avatar Oct 21 '24 09:10 tiagolobocastro

@Lucaber Mark the conversations as resolved if you don't think there is anything more to do before we can merge this.

Abhinandan-Purkait avatar Oct 23 '24 04:10 Abhinandan-Purkait

Merging this once the CI passes. Thanks for the contribution @Lucaber

Abhinandan-Purkait avatar Nov 04 '24 09:11 Abhinandan-Purkait