gcp-compute-persistent-disk-csi-driver icon indicating copy to clipboard operation
gcp-compute-persistent-disk-csi-driver copied to clipboard

DisableDevice not working as expected

Open saikat-royc opened this issue 2 years ago • 12 comments
trafficstars

The target path /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state that the DisableDevice function is attempting to write at is missing.

I printed the path in my local setup PR:

I0216 06:18:40.987335       1 device-utils_linux.go:30] DisableDevice called with device path /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd device name google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd, target filepath /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state
E0216 06:18:40.987444       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd for volume projects/saikatroyc-test/zones/us-central1-c/disks/pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state: no such file or directory

Logging into the node the actual target path is /sys/block/sdb/device/state for a scsi device for example.

saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ sudo /lib/udev/scsi_id -g -d /dev/sdb
0Google  PersistentDisk  pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd
saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ cat /sys/block/sdb/device/state
running

What is of interest is the /dev/sd* derived in deviceFsPath here

$ ls -l /dev/disk/by-id/* | grep pvc-843
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb

saikat-royc avatar Feb 16 '23 06:02 saikat-royc

Ah, good catch.

mattcary avatar Feb 17 '23 00:02 mattcary

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 18 '23 00:05 k8s-triage-robot

/remove-lifecycle stale

Oops, dropped this. I have a quick PR ready.

mattcary avatar May 18 '23 00:05 mattcary

We are seeing the same issue (we are on 1.9.5 atm):

2023-06-27 00:15:46 | E0627 00:15:46.846784       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01 for volume projects/myproject/zones/asia-south1-a/disks/pv--860c5184-2736-4ea3-9f38-d819cde18a01. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01/device/state: no such file or directory

Seems we also don't have these kind of directories /sys/block/google... on our nodes. However what I don't understand: I see this issue happening occasionally. Shouldn't this however always result in the volume failing to detach from GCP as these directories are never present? (and btw, is there already a release containing https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/pull/1235 ?)

dguendisch avatar Jun 27 '23 14:06 dguendisch

That's expected, we have not cherry-picked #1225 and have not cut a new release since it was merged.

The error message is benign. We had thought that disabling the device would prevent some race conditions, but on further investigation they don't come up. So the fact that the disabling isn't succeeding shouldn't cause any problems.

But I agree it's annoying --- k8s errors have enough noise that we shouldn't be adding to them :-)

I'll cherry pick this back to 1.9 and it will get into releases in the next week or too (we're in the middle of pushing out patch releases and I don't know if the CPs will get into this cycle or the next one---there are some CVEs that are being fixed by updating golang versions / image bases that I don't want to delay).

mattcary avatar Jun 27 '23 16:06 mattcary

The error message is benign.

Ah, so you mean it might not be the cause of what we observe?

Aehm, I haven't actually described what we observe :) We see occasionally volumes failing to attach to node X (where a pod using that volume is scheduled to), because the volume is actually attached to node Y, so GCP rightfully refuses to attach that volume to X. However there is no volumeattachment resource for this particular pv that would point to Y (only the "new" volumeattachment pointing to X), hence the csi-attacher has no reason to unattach it from Y. The aforementioned error msg was the only error that I spot upon the unmounting & detaching of this pv from Y.

dguendisch avatar Jun 27 '23 20:06 dguendisch

Yeah, I think what you're observing is unrelated to this error message. Sorry for the noise & confusion!

What is in the node Y volumesAttached at the time you see this error (that's in the node status, you can see it with kubectl describe node)?

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart. We've seen some cases with the provisioner where it looks like its informer cache gets stale --- although that seems unlikely and we don't have a consistent reproduction. But if a new csi attacher instance starts working correctly a stale cache seems less unlikely.

mattcary avatar Jun 27 '23 21:06 mattcary

Happened again today:

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Again, no volumeattachment on Y (though the volume is still attached (according to gcloud) on Y, only one volumeattachment resource pointing to X (which of course fails due the volume being in-use). Node Y has no mentioning of the volume in its .status.volumesAttached/volumesInUse. Node X has the volume mentioned in .status.volumesInUse (and not in .status.volumesAttached which seems correct as it's not attached to X). So no volumeattachment on Y (maybe with some deletionTimestamp) + no ref to that volume from node Y's .status.volumesAttached/volumesInUse => no controller will ever detach this volume from Y.

So I guess the important question is how the "old" volumeattachment on Y could have ever been successfully deleted/finalized?

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart.

Just tried and I guess as expected it didn't change anything as there is no hint that this volume should be detached from Y.

I'll try to dig more logs, would be happy if you had some ideas on how the old volumeattachment on Y could have vanished. Are there maybe some timeouts after which the detachment is finalized although the detachment operation never successfully finished?

dguendisch avatar Jun 28 '23 13:06 dguendisch

I believe the flow is: delete volumeattachment, csi-attacher notices node.volumesAttached volume w/o a volume attachment, and the detaches it.

So I think the weird part is how the volumesAttached got cleared without the volume actually being detached?

mattcary avatar Jun 28 '23 16:06 mattcary

Sorry for the initial confusion, of course you were right that the error msg I showed was not related (after all, it came from the NodeUnstageVolume which is related to mounting/unmounting while my actual issue is rather about attaching/detaching which would translate to ControllerPublish/UnpublishVolume 😅). I finally figured the real cause and filed a dedicated issue: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/issues/1290 Thanks for your suggestions!

dguendisch avatar Jul 06 '23 20:07 dguendisch

Disable still is not working as expected. I'm seeing the following error in logs:

Failed to disabled device /dev/disk/by-id/google-pvc-7d1b9d39-050f-4c80-9a63-961a703cff2f (aka /dev/sdb) for volume projects/psch-gke-dev/zones/us-central1-c/disks/pvc-7d1b9d39-050f-4c80-9a63-961a703cff2f. Device may not be detached cleanly (ignored, unstaging continues): %!w(*fs.PathError=&{write /sys/block/sdb/device/state 22})

pwschuurman avatar Apr 04 '24 01:04 pwschuurman

It seems that disabling the device does not allow subsequent NodeUnstage/NodeStage commands to work. This may require the device to be re-enabled.

Calling scsi_id on a device that has been disabled results in the following error:

scsi_id: cannot open /dev/sdh: No such device or address

Re-enabling this device by writing "running" to the device works. However there is a problem. We use scsi_id to identify the serial. The serial is used to test if this is the device we're looking for. However, we can't re-enable the device without calling scsi_id, which fails. This leads to two possible outcomes:

  1. Disable the device through some other means
  2. Lookup the serial some alternate way that does not require the device to be enabled.
  3. Drop this logic, and rely on the filesystem check: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/pull/1658

(3) is the only approach, if (1) and (2) aren't possible.

pwschuurman avatar Apr 16 '24 02:04 pwschuurman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 15 '24 03:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 14 '24 03:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Sep 13 '24 03:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 13 '24 03:09 k8s-ci-robot