sig-storage-lib-external-provisioner Failing unit test: TestTopologyParams

Unit test TestTopologyParams/provision_with_selected_node,_but_node_does_not_exist fails frequently with log:

W1221 13:41:19.810378 1289631 shared_informer.go:544] resyncPeriod 100ms is too small. Changing it to the minimum allowed value of 1s
W1221 13:41:19.810457 1289631 shared_informer.go:544] resyncPeriod 100ms is too small. Changing it to the minimum allowed value of 1s
I1221 13:41:19.810509 1289631 leaderelection.go:248] attempting to acquire leader lease default/foo.bar-baz...
I1221 13:41:19.810621 1289631 leaderelection.go:258] successfully acquired lease default/foo.bar-baz
I1221 13:41:19.810940 1289631 event.go:285] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"default", Name:"foo.bar-baz", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806 became leader
I1221 13:41:19.810976 1289631 controller.go:811] Starting provisioner controller foo.bar/baz_dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806!
I1221 13:41:19.810985 1289631 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"default", Name:"foo.bar-baz", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806 became leader
I1221 13:41:19.911085 1289631 controller.go:860] Started provisioner controller foo.bar/baz_dhcp129-178.brq.redhat.com_aaaa6be1-6281-4dd1-af24-e5928ad20806!
I1221 13:41:19.911127 1289631 controller.go:1366] provision "default/claim-1" class "class-1": started
I1221 13:41:19.911180 1289631 controller.go:1491] provision "default/claim-1" class "class-1": volume rescheduled because: nodes "node-1" not found
I1221 13:41:19.911183 1289631 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-1", UID:"uid-1-1", APIVersion:"v1", ResourceVersion:"0", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' nodes "node-1" not found
I1221 13:41:19.911204 1289631 controller.go:1366] provision "default/claim-1" class "class-1": started
I1221 13:41:19.911213 1289631 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-1", UID:"uid-1-1", APIVersion:"v1", ResourceVersion:"0", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/claim-1"
I1221 13:41:19.961387 1289631 controller.go:1449] provision "default/claim-1" class "class-1": volume "pvc-uid-1-1" provisioned
I1221 13:41:19.961415 1289631 controller.go:1462] provision "default/claim-1" class "class-1": succeeded
I1221 13:41:19.961440 1289631 volume_store.go:212] Trying to save persistentvolume "pvc-uid-1-1"
I1221 13:41:19.961473 1289631 volume_store.go:219] persistentvolume "pvc-uid-1-1" saved
I1221 13:41:19.961524 1289631 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-1", UID:"uid-1-1", APIVersion:"v1", ResourceVersion:"0", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-uid-1-1
--- FAIL: TestTopologyParams (0.20s)
    --- FAIL: TestTopologyParams/provision_with_selected_node,_but_node_does_not_exist (0.20s)
        controller_test.go:971: did not expect a Provision() call but got at least 1
FAIL

The test should check that no provisioning is called when volume.kubernetes.io/selected-node refers to a node that does not exist. However, during the test the annotation is (correctly) removed and then syncClaim is called again, now with the PVC without any selected-node annotation and provisioning proceeds.

This may be actually a serious bug, the library may provision a topology-unaware volume, while topology was required.

git bisect shows that it's most probably caused by commit 438f6650928e0d957b3a330733cea9c45768a527, @sunnylovestiramisu

Dec 21 '23 12:12 jsafrane

A failed CI run: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_sig-storage-lib-external-provisioner/158/pull-sig-storage-lib-external-provisioner-unit/1718183622202101760

Dec 21 '23 12:12 jsafrane

A failed CI run: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_sig-storage-lib-external-provisioner/158/pull-sig-storage-lib-external-provisioner-unit/1718183622202101760

Seems like an concurrency issue? It looks like this error only occurs when the first request updates the cache after the second request has passed shouldProvisioner check

Dec 25 '23 13:12 mowangdk

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 24 '24 13:03 k8s-triage-robot

/remove-lifecycle stale

Mar 28 '24 13:03 xing-yang

sig-storage-lib-external-provisioner sig-storage-lib-external-provisioner copied to clipboard

Failing unit test: TestTopologyParams

sig-storage-lib-external-provisioner
sig-storage-lib-external-provisioner copied to clipboard