vsphere-csi-driver icon indicating copy to clipboard operation
vsphere-csi-driver copied to clipboard

CNS volumes disappear and all in cluster operations fail

Open LinAnt opened this issue 2 years ago • 20 comments

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see...

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [vSAN@6876 sub=AccessChecker] User CLISTER.LOCAL\[email protected] was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [vSAN@6876 sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [vSAN@6876 sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [vSAN@6876 sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [vSAN@6876 sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [vSAN@6876 sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [vSAN@6876 sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [vSAN@6876 sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [vSAN@6876 sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [vSAN@6876 sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

LinAnt avatar Sep 17 '21 10:09 LinAnt

Im observing a very similar behaviour here, a PVC has just disappeared from vSphere and cant be attached now

MaxRink avatar Oct 28 '21 15:10 MaxRink

hello, me too, I have the same behavior (CSI version 2.2.2)

When I list volumes using govc volume.ls, the volume is not present. But with the command govc disk.ls, it is..

Sometime the volume shows up and disaperrar, this is very strange.

aevenot avatar Dec 13 '21 13:12 aevenot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 13 '22 14:03 k8s-triage-robot

I also observe the similar issue. Some details from the log

  1. volume is attached to node1 successfully
  2. volume is detached to node1 successfully (but strange thing is detach is called twice consecutively and the second time it logs volume is not found)
  3. volume is trying to attached to node2, but failed with not found error
{"level":"info","time":"2022-04-10T17:35:41.790422844Z","caller":"vanilla/controller.go:951","msg":"ControllerPublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbae8d VolumeCapability:mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1649603959450-8081-csi.vsphere.vmware.com type:vSphere CNS Block Volume] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"b7849fe8-655d-41af-83f8-d5704cc297a0"}
{"level":"info","time":"2022-04-10T17:35:42.211793124Z","caller":"volume/manager.go:632","msg":"AttachVolume: volumeID: \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\", vm: \"VirtualMachine:vm-1013713 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f473-2987-d995-719b-eb65cfc7b00b, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]\", opId: \"0cfc5ab7\"","TraceId":"b7849fe8-655d-41af-83f8-d5704cc297a0"}
{"level":"info","time":"2022-04-10T17:35:42.211860571Z","caller":"volume/manager.go:667","msg":"AttachVolume: Volume attached successfully. volumeID: \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\", opId: \"0cfc5ab7\", vm: \"VirtualMachine:vm-1013713 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f473-2987-d995-719b-eb65cfc7b00b, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]\", diskUUID: \"6000C296-2e28-3df5-d780-af6831724cc6\"","TraceId":"b7849fe8-655d-41af-83f8-d5704cc297a0"}
{"level":"info","time":"2022-04-10T17:35:42.211891383Z","caller":"vanilla/controller.go:1045","msg":"ControllerPublishVolume successful with publish context: map[diskUUID:6000c2962e283df5d780af6831724cc6 type:vSphere CNS Block Volume]","TraceId":"b7849fe8-655d-41af-83f8-d5704cc297a0"}
{"level":"info","time":"2022-04-10T17:36:31.667566148Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbae8d Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"b0832a4a-28da-4c22-a869-8cacde7c7592"}
{"level":"info","time":"2022-04-10T17:36:31.928688013Z","caller":"volume/manager.go:753","msg":"DetachVolume: volumeID: \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\", vm: \"VirtualMachine:vm-1013713 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f473-2987-d995-719b-eb65cfc7b00b, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]\", opId: \"0cfc5ad6\"","TraceId":"b0832a4a-28da-4c22-a869-8cacde7c7592"}
{"level":"info","time":"2022-04-10T17:36:31.928813846Z","caller":"volume/manager.go:788","msg":"DetachVolume: Volume detached successfully. volumeID: \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\", vm: \"0cfc5ad6\", opId: \"VirtualMachine:vm-1013713 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f473-2987-d995-719b-eb65cfc7b00b, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]\"","TraceId":"b0832a4a-28da-4c22-a869-8cacde7c7592"}
{"level":"info","time":"2022-04-10T17:36:31.92890707Z","caller":"vanilla/controller.go:1159","msg":"ControllerUnpublishVolume successful for volume ID: e5dc0f1b-399b-4839-955e-18a174e4e1ea","TraceId":"b0832a4a-28da-4c22-a869-8cacde7c7592"}
{"level":"info","time":"2022-04-10T17:36:31.938332455Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbae8d Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"7e6019ad-5b7e-4106-981f-a68a87501d93"}
{"level":"info","time":"2022-04-10T17:36:32.091578876Z","caller":"volume/manager.go:753","msg":"DetachVolume: volumeID: \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\", vm: \"VirtualMachine:vm-1013713 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f473-2987-d995-719b-eb65cfc7b00b, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]\", opId: \"0cfc5ada\"","TraceId":"7e6019ad-5b7e-4106-981f-a68a87501d93"}
{"level":"info","time":"2022-04-10T17:36:32.091902929Z","caller":"volume/util.go:364","msg":"Extract vimfault type: +*types.NotFound  vimFault: +&{{{<nil> []}}} Fault: &{DynamicData:{} Fault:0xc000d7b980 LocalizedMessage:The object or item referred to could not be found.} from resp: +&{{} {{} } 0xc000d7b720}","TraceId":"7e6019ad-5b7e-4106-981f-a68a87501d93"}
{"level":"info","time":"2022-04-10T17:36:32.094496975Z","caller":"volume/manager.go:780","msg":"DetachVolume: volumeID: \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\" not found on vm: VirtualMachine:vm-1013713 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f473-2987-d995-719b-eb65cfc7b00b, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]. Assuming it is already detached","TraceId":"7e6019ad-5b7e-4106-981f-a68a87501d93"}
{"level":"info","time":"2022-04-10T17:36:32.094536352Z","caller":"vanilla/controller.go:1159","msg":"ControllerUnpublishVolume successful for volume ID: e5dc0f1b-399b-4839-955e-18a174e4e1ea","TraceId":"7e6019ad-5b7e-4106-981f-a68a87501d93"}
{"level":"info","time":"2022-04-10T17:40:09.190591921Z","caller":"node/manager.go:106","msg":"Successfully registered node: \"08050d6e39f9-qual-322-0afbac17\" with nodeUUID \"4204f258-985f-b7f0-0782-e9a78fe37425\"","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:09.190652275Z","caller":"vsphere/virtualmachine.go:147","msg":"Initiating asynchronous datacenter listing with uuid 4204f258-985f-b7f0-0782-e9a78fe37425","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:09.218975792Z","caller":"vsphere/datacenter.go:149","msg":"Publishing datacenter Datacenter [Datacenter: Datacenter:datacenter-3 @ /atl-qual-vc02, VirtualCenterHost: atl-qual-vc02.anthos]","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:09.219055143Z","caller":"vsphere/virtualmachine.go:184","msg":"AsyncGetAllDatacenters with uuid 4204f258-985f-b7f0-0782-e9a78fe37425 sent a dc Datacenter [Datacenter: Datacenter:datacenter-3 @ /atl-qual-vc02, VirtualCenterHost: atl-qual-vc02.anthos]","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:09.224598843Z","caller":"vsphere/virtualmachine.go:198","msg":"Found VM VirtualMachine:vm-1013755 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f258-985f-b7f0-0782-e9a78fe37425, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /atl-qual-vc02, VirtualCenterHost: atl-qual-vc02.anthos]] given uuid 4204f258-985f-b7f0-0782-e9a78fe37425 on DC Datacenter [Datacenter: Datacenter:datacenter-3 @ /atl-qual-vc02, VirtualCenterHost: atl-qual-vc02.anthos]","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:09.224641496Z","caller":"vsphere/virtualmachine.go:209","msg":"Returning VM VirtualMachine:vm-1013755 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f258-985f-b7f0-0782-e9a78fe37425, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /atl-qual-vc02, VirtualCenterHost: atl-qual-vc02.anthos]] for UUID 4204f258-985f-b7f0-0782-e9a78fe37425","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:09.224664302Z","caller":"node/manager.go:127","msg":"Successfully discovered node with nodeUUID 4204f258-985f-b7f0-0782-e9a78fe37425 in vm VirtualMachine:vm-1013755 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f258-985f-b7f0-0782-e9a78fe37425, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /atl-qual-vc02, VirtualCenterHost: atl-qual-vc02.anthos]]","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:09.224677411Z","caller":"node/manager.go:112","msg":"Successfully discovered node: \"08050d6e39f9-qual-322-0afbac17\" with nodeUUID \"4204f258-985f-b7f0-0782-e9a78fe37425\"","TraceId":"e9b8c1df-92a4-4c20-a570-e02e52a6e6ff"}
{"level":"info","time":"2022-04-10T17:40:35.835323129Z","caller":"vanilla/controller.go:951","msg":"ControllerPublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbac17 VolumeCapability:mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1649603959450-8081-csi.vsphere.vmware.com type:vSphere CNS Block Volume] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"4040d825-2600-4c49-b02d-e89d42ccc2fe"}
{"level":"error","time":"2022-04-10T17:40:35.859169488Z","caller":"volume/manager.go:616","msg":"CNS AttachVolume failed from vCenter \"atl-qual-vc02.anthos\" with err: ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n}","TraceId":"4040d825-2600-4c49-b02d-e89d42ccc2fe","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-lib/volume.(*defaultManager).AttachVolume.func1\n\t/build/pkg/common/cns-lib/volume/manager.go:616\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-lib/volume.(*defaultManager).AttachVolume\n\t/build/pkg/common/cns-lib/volume/manager.go:672\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.AttachVolumeUtil\n\t/build/pkg/csi/service/common/vsphereutil.go:548\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:1037\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:1050\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5632\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerPublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:120\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:86\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5634\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
{"level":"info","time":"2022-04-10T17:40:35.861356096Z","caller":"volume/util.go:343","msg":"Extract vimfault type: +types.NotFound. SoapFault Info: +&{{http://schemas.xmlsoap.org/soap/envelope/ Fault} ServerFaultCode CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n} {{{{<nil> []}}}}} from err +ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n}","TraceId":"4040d825-2600-4c49-b02d-e89d42ccc2fe"}


jingxu97 avatar Apr 12 '22 01:04 jingxu97

cc @xing-yang @divyenpatel

jingxu97 avatar Apr 12 '22 01:04 jingxu97

/remove-lifecycle stale

xing-yang avatar Apr 12 '22 01:04 xing-yang

@jingxu97 What version of CSI driver are you using? What is vSphere version? Did it retry when this happened? Also what has triggered this detach from node 1 and attach to node 2?

xing-yang avatar Apr 12 '22 02:04 xing-yang

vSphere version 7.0u3 CSI driver: 2.4.0

The test (anthos storage qualification test) is designed to move the pod to a different node and then access the same volume.

jingxu97 avatar Apr 12 '22 02:04 jingxu97

From our testing, looks 6.7u3 is ok, but 7.0 failed very often

jingxu97 avatar Apr 12 '22 17:04 jingxu97

Also another issue related to this test is since attach volume failed with "not found" error, attach_detach_controller will mark volume as uncertain, and try to detach volume when pod is deleted, however detach also failed with "not found" error

{"level":"info","time":"2022-04-10T17:44:49.039474284Z","caller":"vanilla/controller.go:951","msg":"ControllerPublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbac17 VolumeCapability:mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1649603959450-8081-csi.vsphere.vmware.com type:vSphere CNS Block Volume] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9"}
{"level":"error","time":"2022-04-10T17:44:49.072198256Z","caller":"volume/manager.go:616","msg":"CNS AttachVolume failed from vCenter \"atl-qual-vc02.anthos\" with err: ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n}","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-lib/volume.(*defaultManager).AttachVolume.func1\n\t/build/pkg/common/cns-lib/volume/manager.go:616\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-lib/volume.(*defaultManager).AttachVolume\n\t/build/pkg/common/cns-lib/volume/manager.go:672\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.AttachVolumeUtil\n\t/build/pkg/csi/service/common/vsphereutil.go:548\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:1037\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:1050\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5632\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerPublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:120\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:86\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5634\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
{"level":"info","time":"2022-04-10T17:44:49.07233599Z","caller":"volume/util.go:343","msg":"Extract vimfault type: +types.NotFound. SoapFault Info: +&{{http://schemas.xmlsoap.org/soap/envelope/ Fault} ServerFaultCode CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n} {{{{<nil> []}}}}} from err +ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n}","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9"}
{"level":"error","time":"2022-04-10T17:44:49.072375655Z","caller":"common/vsphereutil.go:550","msg":"failed to attach disk \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\" with VM: \"VirtualMachine:vm-1013755 [VirtualCenterHost: atl-qual-vc02.anthos, UUID: 4204f258-985f-b7f0-0782-e9a78fe37425, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: atl-qual-vc02.anthos]]\". err: ServerFaultCode: CNS: Failed to retrieve datastore for vol e5dc0f1b-399b-4839-955e-18a174e4e1ea. (vim.fault.NotFound) {\n   faultCause = (vmodl.MethodFault) null, \n   faultMessage = <unset>\n   msg = \"The vStorageObject (vim.vslm.ID) {\n   dynamicType = null,\n   dynamicProperty = null,\n   id = e5dc0f1b-399b-4839-955e-18a174e4e1ea\n} was not found\"\n} faultType \"vim.fault.NotFound\"","TraceId":"6d250629-8008-41b7-baed-2deb2a6baca9","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.AttachVolumeUtil\n\t/build/pkg/csi/service/common/vsphereutil.go:550\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:1037\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerPublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:1050\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5632\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerPublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:120\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:86\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerPublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5634\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}
{"level":"info","time":"2022-04-10T17:47:20.066064719Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:e5dc0f1b-399b-4839-955e-18a174e4e1ea NodeId:08050d6e39f9-qual-322-0afbac17 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"604cd65a-62ce-49f2-9859-41f6630b7b85"}
{"level":"error","time":"2022-04-10T17:47:20.084717468Z","caller":"vanilla/controller.go:1108","msg":"volumeID \"e5dc0f1b-399b-4839-955e-18a174e4e1ea\" not found in QueryVolume","TraceId":"604cd65a-62ce-49f2-9859-41f6630b7b85","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:1108\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:1162\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5650\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/[email protected]/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/[email protected]/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5652\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:722"}

Should this PR fix the detach issue? https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/840 In this case the volume is not attached to VM so detach should succeed.

jingxu97 avatar Apr 12 '22 17:04 jingxu97

@jingxu97 I see that the error says "CNS: Failed to retrieve datastore for vol ...". Is there anything happening to the datastore? Are you doing something such as datastore evacuation?

In #840, if the VM does not exist in vCenter inventory any more, detach will be successful. If failed to check if volume is still attached, it still returns failure.

xing-yang avatar Apr 12 '22 23:04 xing-yang

I don't think there is datastore evacuation happening.

This specific test flaky is "an existing volume should be accessible on a new node after cluster scale up" from Anthos qualification test. I think it only happened in the situation

  1. pod with a volume created
  2. delete pod
  3. create a new node to join the cluster
  4. create the pod on the new node

There are other tests that will make volume detached and attached to a different node, but only this test fail. The only difference I can see is it is trying to attach to a new node. In this case, attach failed with "Failed to retrieve datastore for vol". After timeout, it tries to detach and also failed with the same error.

jingxu97 avatar Apr 13 '22 18:04 jingxu97

@jingxu97 Can you provide more logs, i.e., complete CSI driver and syncer logs, CSI-attacher sidecar logs, and VC support bundle?

xing-yang avatar Apr 13 '22 19:04 xing-yang

@jingxu97 Have you deployed 2 K8s clusters with the same cluster id? We have seen similar things happening in the past when there are 2 K8s clusters with the same cluster id.

xing-yang avatar Apr 13 '22 21:04 xing-yang

vsphere-csi-controller log: https://gist.github.com/jingxu97/842b05c0452d8696d98a96e56793a048 csi-attacher log: https://gist.github.com/jingxu97/0289b2277251b31a01f75b057fd47615

jingxu97 avatar Apr 14 '22 00:04 jingxu97

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 13 '22 18:07 k8s-triage-robot

We are facing same issue as @jingxu97

# Disk gets listed
❯ govc disk.ls -ds=XXXX-Datastore | grep 38de0cc1b7d7
3eea43aa-e554-4fc5-89ee-38de0cc1b7d7  af2db7e5-8751-11ec-aeb2-d6a8c0fcbe1c

# But not volume
❯ govc volume.ls | grep 38de0cc1b7d7 | wc -l
0

@jingxu97 - how do you recover out of such situation?

dharapvj avatar Jul 26 '22 11:07 dharapvj

It turned out our system backend (maintenance activity) did issue deleting volumes in certain cases. So it is not an issue from vcenter side.

jingxu97 avatar Jul 26 '22 14:07 jingxu97

/remove-lifecycle stale

dharapvj avatar Jul 27 '22 09:07 dharapvj

@xing-yang - how do we engage you / this team formally to help resolve this issue? We already have raised a vmware ticket (support request no. 22349608807). How do we get vmware to involve this team to resolve our issues?

dharapvj avatar Jul 28 '22 05:07 dharapvj

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 26 '22 05:10 k8s-triage-robot

/assign @xing-yang

lipingxue avatar Nov 17 '22 22:11 lipingxue

@dharapvj Sorry, I missed your ping. Have you got any update on the SR 22349608807? Usually the support team will investigate first and will open a bug for the CSI/CNS team to investigate further if needed.

xing-yang avatar Nov 17 '22 22:11 xing-yang

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 17 '22 23:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 17 '23 00:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 17 '23 00:01 k8s-ci-robot

@dharapvj Could you recover from your situation? How could your re-register those disks as volumes?

verysonglaa avatar Jan 22 '24 10:01 verysonglaa

VMware identified an issue in internal db of vcenter where one table had duplicate entry for some unknown reason.

After deleting that entry row.. everything started to work properly without any changes on k8s side

On Mon, 22 Jan 2024, 15:52 verysonglaa, @.***> wrote:

@dharapvj https://github.com/dharapvj Could you recover from your situation? How could your re-register those disks as volumes?

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1274#issuecomment-1903679167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOU7HXSGDWVIKGCKZYEGMTYPY4YFAVCNFSM5EGXKNW2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJQGM3DOOJRGY3Q . You are receiving this because you were mentioned.Message ID: @.***>

dharapvj avatar Jan 23 '24 12:01 dharapvj