vsphere-csi-driver Sometimes vsphere syncer fails to sync metadata with unable to acquire file lock

Sometimes vsphere syncer fails to sync metadata with unable to acquire file lock

Open gnufied opened this issue 2 years ago • 9 comments

It looks some sync fail with following error:

{"level":"error","time":"2022-03-14T22:49:04.266550924Z","caller":"volume/manager.go:1103","msg":"failed to update volume. 
updateSpec: \"(*types.CnsVolumeMetadataUpdateSpec)(0xc00107b5f0)({\\n DynamicData: (types.DynamicData) {\\n },\\n 
VolumeId: (types.CnsVolumeId) {\\n  DynamicData: (types.DynamicData) {\\n  },\\n  Id: (string) (len=36) \\\"ba096274-d1ce-41c1-953b-6bda7b74945b\\\"\\n },\\n Metadata: (types.CnsVolumeMetadata) {\\n  DynamicData: (types.DynamicData) {\\n  
},\\n  ContainerCluster: (types.CnsContainerCluster) {\\n   DynamicData: (types.DynamicData) {\\n   },\\n   ClusterType: (string) (len=10) \\\"KUBERNETES\\\",\\n   ClusterId: (string) (len=26) \\\"ci-op-s63trmc3-55b1b-sdvq4\\\",\\n   VSphereUser: (string) 
(len=22) \\\"VSPHERE.LOCAL\\\\\\\\ci_user4\\\",\\n   ClusterFlavor: (string) (len=7) \\\"VANILLA\\\",\\n   ClusterDistribution: (string) \\\"\\\"\\n  },\\n  EntityMetadata: ([]types.BaseCnsEntityMetadata) (len=1 cap=1) {\\n   (*types.CnsKubernetesEntityMetadata)
(0xc0002c7380)({\\n    CnsEntityMetadata: (types.CnsEntityMetadata) {\\n     DynamicData: (types.DynamicData) {\\n     },\\n     EntityName: (string) (len=40) \\\"pvc-5f4aff1b-9af5-418b-9031-9816ab8acb2f\\\",\\n     Labels: ([]types.KeyValue) <nil>,\\n     
Delete: (bool) false,\\n     ClusterID: (string) (len=26) \\\"ci-op-s63trmc3-55b1b-sdvq4\\\"\\n    },\\n    EntityType: (string) (len=17) \\\"PERSISTENT_VOLUME\\\",\\n    Namespace: (string) \\\"\\\",\\n    ReferredEntity: ([]types.CnsKubernetesEntityReference) 
<nil>\\n   })\\n  },\\n  ContainerClusterArray: ([]types.CnsContainerCluster) (len=1 cap=1) {\\n   (types.CnsContainerCluster) {\\n    DynamicData: (types.DynamicData) {\\n    },\\n    ClusterType: (string) (len=10) \\\"KUBERNETES\\\",\\n    ClusterId: (string) 
(len=26) \\\"ci-op-s63trmc3-55b1b-sdvq4\\\",\\n    VSphereUser: (string) (len=22) \\\"[email protected]\\\",\\n    ClusterFlavor: (string) (len=7) \\\"VANILLA\\\",\\n    ClusterDistribution: (string) \\\"\\\"\\n   }\\n  }\\n }\\n})\\n\", fault: 
\"(*types.LocalizedMethodFault)(0xc001c3a060)({\\n DynamicData: (types.DynamicData) {\\n },\\n Fault: (types.CnsFault) {\\n  BaseMethodFault: (types.BaseMethodFault) <nil>,\\n  Reason: (string) (len=560) \\\"(vmodl.fault.SystemError) {\\\\n   faultCause 
= (vmodl.MethodFault) null, \\\\n   faultMessage = <unset>, \\\\n   reason = \\\\\\\"Failed to lock the file: api = DiskLib_Open, _diskPath->CValue() = /vmfs/volumes/vsan:523ea352e875627d-b090c96b526bb79c/bd294161-20a1-00f7-fd05-3cecef1b8ff6
/_0090/e4daa20ac7fa496b833954ba2d923d3c.vmdk\\\\\\\"\\\\n   msg = \\\\\\\"A general system error occurred: Failed to lock the file: api = DiskLib_Open, _diskPath->CValue() = /vmfs/volumes/vsan:523ea352e875627d-b090c96b526bb79c
/bd294161-20a1-00f7-fd05-3cecef1b8ff6/_0090/e4daa20ac7fa496b833954ba2d923d3c.vmdk\\\\\\\"\\\\n}\\\"\\n },\\n LocalizedMessage: (string) (len=576) \\\"CnsFault error: (vmodl.fault.SystemError) {\\\\n   faultCause = (vmodl.MethodFault) null, 
\\\\n   faultMessage = <unset>, \\\\n   reason = \\\\\\\"Failed to lock the file: api = DiskLib_Open, _diskPath->CValue() = /vmfs/volumes/vsan:523ea352e875627d-b090c96b526bb79c/bd294161-20a1-00f7-fd05-3cecef1b8ff6/_0090
/e4daa20ac7fa496b833954ba2d923d3c.vmdk\\\\\\\"\\\\n   msg = \\\\\\\"A general system error occurred: Failed to lock the file: api = DiskLib_Open, _diskPath->CValue() = /vmfs/volumes/vsan:523ea352e875627d-b090c96b526bb79c/bd294161-20a1-00f7-
fd05-3cecef1b8ff6/_0090/e4daa20ac7fa496b833954ba2d923d3c.vmdk\\\\\\\"\\\\n}\\\"\\n})\\n\", opID: \"c8645a92\"","TraceId":"70c5efe3-23ee-40ea-9872-96d62cb707de","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-
lib/volume.(*defaultManager).UpdateVolumeMetadata.func1\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/common/cns-lib/volume/manager.go:1103\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/common/cns-lib/volume.
(*defaultManager).UpdateVolumeMetadata\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/common/cns-lib/volume/manager.go:1111\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/syncer.csiPVUpdated\n\t/go/src/git

The key message is:

Failed to lock the file: api = DiskLib_Open, _diskPath->CValue() = /vmfs/volumes/vsan:523ea352e875627d-b090c96b526bb79c
/bd294161-20a1-00f7-fd05-3cecef1b8ff6/_0090/e4daa20ac7fa496b833954ba2d923d3c.vmdk

Is vsphere syncer racy? Is this because of concurrent actions happening against same volume?

cc @RaunakShah @divyenpatel

Mar 15 '22 19:03 gnufied

@gnufied what is the vSphere version you are using?

Mar 18 '22 19:03 divyenpatel

It appears to be 7.0.2 - build 17920168

Mar 18 '22 19:03 gnufied

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 16 '22 20:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 16 '22 21:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Aug 15 '22 21:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 15 '22 21:08 k8s-ci-robot

/reopen

Oct 11 '22 19:10 gnufied

@gnufied: Reopened this issue.

In response to this:

/reopen

Oct 11 '22 19:10 k8s-ci-robot

/remove-lifecycle rotten

Oct 11 '22 19:10 gnufied

/assign

Nov 17 '22 22:11 adikul30

@gnufied Apologies for the delayed response. From the log, it is a CnsFault. You can also see an opID associated with the CNS task. To investigate, CNS usually requires a VC support bundle. Is that available? Additionally, is this a recurring condition?

Jan 05 '23 20:01 adikul30

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 05 '23 21:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 05 '23 22:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 04 '23 22:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 04 '23 22:06 k8s-ci-robot

vsphere-csi-driver vsphere-csi-driver copied to clipboard

Sometimes vsphere syncer fails to sync metadata with unable to acquire file lock

vsphere-csi-driver
vsphere-csi-driver copied to clipboard