kadalu icon indicating copy to clipboard operation
kadalu copied to clipboard

Node plugin is failing with NodeUnpublishVolume exception constanly

Open RadoslawBuller opened this issue 1 year ago • 4 comments

Describe the bug Node plugin is failing with NodeUnpublishVolume exception constanly.

To Reproduce Steps to reproduce the behavior:

  1. Install kadalu 1.1.0 on K8s v1.27.2

Expected behavior CSI driver to support PUBLISH_UNPUBLISH_VOLUME (?)

Actual behavior $ kubectl logs kadalu-csi-provisioner-0 . . . I0606 07:35:36.908846 1 common.go:111] Probing CSI driver for readiness W0606 07:35:36.910957 1 metrics.go:142] metrics endpoint will not be started because metrics-address was not specified. I0606 07:35:36.913309 1 csi-provisioner.go:210] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments . . . $ kubectl logs kadalu-csi-nodeplugin-d9f7z -c kadalu-nodeplugin . . . [2023-06-06 07:56:59,718] ERROR [_server - 454:_call_behavior] - Exception calling application: [1] Traceback (most recent call last): File "/kadalu/lib/python3.10/site-packages/grpc/_server.py", line 444, in _call_behavior response_or_iterator = behavior(argument, context) File "/kadalu/nodeserver.py", line 145, in NodeUnpublishVolume unmount_volume(request.target_path) File "/kadalu/volumeutils.py", line 901, in unmount_volume device, _, _ = execute(*cmd) File "/kadalu/kadalulib.py", line 187, in execute raise CommandException(proc.returncode, out.strip(), err.strip()) kadalulib.CommandException: [1] . . .

Environment:

  • Kadalu Version: 1.1.0
  • K8S_DIST: kubernetes v1.27.2

RadoslawBuller avatar Jun 06 '23 07:06 RadoslawBuller

  • seems similar to #948

CSI driver to support PUBLISH_UNPUBLISH_VOLUME

  • no, we don't support it and it's expected

  • could you pls mention pool CR that you have used?

leelavg avatar Jun 07 '23 01:06 leelavg

I just checked logs and they flooded all the time. I used:

$ kubectl kadalu storage-add replica3 --type=Replica3 --device node1:/dev/sda --device=node2:/dev/sda --device node3:/dev/sda

Also 7 pvcs (out of all 9 in my small k8s home cluster) are stuck in needing a heal:

$ kubectl-kadalu healinfo
Giving heal summary of volume replica3:
Brick server-replica3-0-0.replica3:/bricks/replica3/data/brick
Status: Connected
Total Number of entries: 7
Number of entries in heal pending: 7
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick server-replica3-1-0.replica3:/bricks/replica3/data/brick
Status: Connected
Total Number of entries: 7
Number of entries in heal pending: 7
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick server-replica3-2-0.replica3:/bricks/replica3/data/brick
Status: Connected
Total Number of entries: 7
Number of entries in heal pending: 7
Number of entries in split-brain: 0
Number of entries possibly healing: 0

List of files needing a heal on replica3:
Brick server-replica3-0-0.replica3:/bricks/replica3/data/brick
/subvol/55/72/pvc-1401651f-6ca3-4dc4-8594-eddf581dd79f
/subvol/96/4c/pvc-05f7121d-39b9-42a8-8754-d95dd11c427c
/subvol/ad/52/pvc-907c2880-f424-4e0b-9b1a-d1c56dd8add5
/subvol/b0/c1/pvc-716143fa-5283-43dc-a5c7-85a359905911
/subvol/e9/2e/pvc-2e987831-f392-419c-a73b-5e9b659b64f9
/subvol/ed/96/pvc-7d3a7237-4452-4990-91c1-bb8eabf892bd
/subvol/82/39/pvc-b07e6d57-141c-4e16-b32a-bd4ea67d80e9
Status: Connected
Number of entries: 7

Brick server-replica3-1-0.replica3:/bricks/replica3/data/brick
/subvol/55/72/pvc-1401651f-6ca3-4dc4-8594-eddf581dd79f
/subvol/96/4c/pvc-05f7121d-39b9-42a8-8754-d95dd11c427c
/subvol/ad/52/pvc-907c2880-f424-4e0b-9b1a-d1c56dd8add5
/subvol/b0/c1/pvc-716143fa-5283-43dc-a5c7-85a359905911
/subvol/e9/2e/pvc-2e987831-f392-419c-a73b-5e9b659b64f9
/subvol/ed/96/pvc-7d3a7237-4452-4990-91c1-bb8eabf892bd
/subvol/82/39/pvc-b07e6d57-141c-4e16-b32a-bd4ea67d80e9
Status: Connected
Number of entries: 7

Brick server-replica3-2-0.replica3:/bricks/replica3/data/brick
/subvol/55/72/pvc-1401651f-6ca3-4dc4-8594-eddf581dd79f
/subvol/96/4c/pvc-05f7121d-39b9-42a8-8754-d95dd11c427c
/subvol/ad/52/pvc-907c2880-f424-4e0b-9b1a-d1c56dd8add5
/subvol/b0/c1/pvc-716143fa-5283-43dc-a5c7-85a359905911
/subvol/e9/2e/pvc-2e987831-f392-419c-a73b-5e9b659b64f9
/subvol/ed/96/pvc-7d3a7237-4452-4990-91c1-bb8eabf892bd
/subvol/82/39/pvc-b07e6d57-141c-4e16-b32a-bd4ea67d80e9
Status: Connected
Number of entries: 7


List of files in splitbrain on replica3:
Brick server-replica3-0-0.replica3:/bricks/replica3/data/brick
Status: Connected
Number of entries in split-brain: 0

Brick server-replica3-1-0.replica3:/bricks/replica3/data/brick
Status: Connected
Number of entries in split-brain: 0

Brick server-replica3-2-0.replica3:/bricks/replica3/data/brick
Status: Connected
Number of entries in split-brain: 0

RadoslawBuller avatar Jun 12 '23 18:06 RadoslawBuller

will track nodeplugin error in #948

leelavg avatar Jun 22 '23 03:06 leelavg

Also 7 pvcs (out of all 9 in my small k8s home cluster) are stuck in needing a heal:

  • as discovered in #952 the info surfaced to the user is not legit in some of the cases
  • we are awaiting fix at gluster layer

leelavg avatar Sep 15 '23 04:09 leelavg