azuredisk-csi-driver
azuredisk-csi-driver copied to clipboard
Failed to stage volume to node
What happened: When we run several pod on k8s cluster two of the pods which running on a same node can not work
describe pod
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl describe pod sample-statefulset-1
Name: sample-statefulset-1
Namespace: od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf
Priority: 0
Service Account: stateful-set-sa
Node: 849574ed-eef0-49a3-aef2-a03c1737cd4a/10.0.16.20
Start Time: Tue, 09 Apr 2024 13:35:44 +0000
Labels: app=pv-ex
controller-revision-hash=sample-statefulset-b77598d95
statefulset.kubernetes.io/pod-name=sample-statefulset-1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/sample-statefulset
Containers:
pv-ex:
Container ID:
Image: gcr.io/cf-pks-releng-a1/pksrelengci/persistent-example:latest
Image ID:
Port: 3000/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment:
LOG_FILE: /usr/share/pv-ex/log.txt
Mounts:
/usr/share/pv-ex from www (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2rs44 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
www:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: www-sample-statefulset-1
ReadOnly: false
kube-api-access-2rs44:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 6m21s (x273 over 18h) kubelet MountVolume.MountDevice failed for volume "pvc-53087397-597f-41aa-bccb-e2d8ce450626" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 97s (x482 over 18h) kubelet Unable to attach or mount volumes: unmounted volumes=[www], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
describe pv:
kubectl describe pv pvc-53087397-597f-41aa-bccb-e2d8ce450626
Name: pvc-53087397-597f-41aa-bccb-e2d8ce450626
Labels: <none>
Annotations: pv.kubernetes.io/migrated-to: disk.csi.azure.com
pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
volume.kubernetes.io/provisioner-deletion-secret-name:
volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers: [kubernetes.io/pv-protection external-attacher/disk-csi-azure-com]
StorageClass: ci-storage
Status: Bound
Claim: od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf/www-sample-statefulset-1
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 1Gi
Node Affinity:
Required Terms:
Term 0: topology.disk.csi.azure.com/zone in [westus2-1]
Message:
Source:
Type: AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
DiskName: pvc-53087397-597f-41aa-bccb-e2d8ce450626
DiskURI: /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
Kind: Managed
FSType:
CachingMode: ReadWrite
ReadOnly: false
Events: <none>
kubectl describe volumeattachment csi-5e61ec3524c386fbcdbfa765f2c0b089c359348ef4961e9d4e77ff0134b12aa7
Name: csi-5e61ec3524c386fbcdbfa765f2c0b089c359348ef4961e9d4e77ff0134b12aa7
Namespace:
Labels: <none>
Annotations: csi.alpha.kubernetes.io/node-id: 849574ed-eef0-49a3-aef2-a03c1737cd4a
API Version: storage.k8s.io/v1
Kind: VolumeAttachment
Metadata:
Creation Timestamp: 2024-04-09T13:35:44Z
Finalizers:
external-attacher/disk-csi-azure-com
Resource Version: 102334
UID: 15ff9764-3ab6-4c99-b809-f2a26cb08362
Spec:
Attacher: disk.csi.azure.com
Node Name: 849574ed-eef0-49a3-aef2-a03c1737cd4a
Source:
Persistent Volume Name: pvc-53087397-597f-41aa-bccb-e2d8ce450626
Status:
Attached: true
Attachment Metadata:
LUN: 2
Diskiopsreadwrite: 500
Diskmbpsreadwrite: 60
Networkaccesspolicy: AllowAll
Skuname: Standard_LRS
Events: <none>
csicontroller.log
kubectl logs csi-azuredisk-controller-b5cd56b5f-85wcc -n kube-system | grep pvc-53087397-597f-41aa-bccb-e2d8ce450626
Defaulted container "csi-provisioner" out of: csi-provisioner, csi-attacher, csi-snapshotter, csi-resizer, liveness-probe, azuredisk
I0409 13:35:42.159495 1 controller.go:923] successfully created PV pvc-53087397-597f-41aa-bccb-e2d8ce450626 for PVC www-sample-statefulset-1 and csi volume name /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:42.159599 1 controller.go:1442] provision "od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf/www-sample-statefulset-1" class "ci-storage": volume "pvc-53087397-597f-41aa-bccb-e2d8ce450626" provisioned
I0409 13:35:42.169628 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf", Name:"www-sample-statefulset-1", UID:"53087397-597f-41aa-bccb-e2d8ce450626", APIVersion:"v1", ResourceVersion:"102231", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-53087397-597f-41aa-bccb-e2d8ce450626
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-controller-b5cd56b5f-85wcc -n kube-system -c azuredisk | grep pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:39.617879 1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1","topology.kubernetes.io/zone":"westus2-1"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-2","topology.kubernetes.io/zone":"westus2-2"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-3","topology.kubernetes.io/zone":"westus2-3"}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-3","topology.kubernetes.io/zone":"westus2-3"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-2","topology.kubernetes.io/zone":"westus2-2"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1","topology.kubernetes.io/zone":"westus2-1"}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","parameters":{"csi.storage.k8s.io/pv/name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","csi.storage.k8s.io/pvc/name":"www-sample-statefulset-1","csi.storage.k8s.io/pvc/namespace":"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf","kind":"managed","storageaccounttype":"Standard_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{}},"access_mode":{"mode":7}}]}
I0409 13:35:39.618148 1 controllerserver.go:198] begin to create azure disk(pvc-53087397-597f-41aa-bccb-e2d8ce450626) account type(Standard_LRS) rg(obsidian-flier) location(westus2) size(1) diskZone(westus2-1) maxShares(0)
I0409 13:35:39.618168 1 azure_managedDiskController.go:98] azureDisk - creating new managed Name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 StorageAccountType:Standard_LRS Size:1
I0409 13:35:42.158743 1 azure_managedDiskController.go:288] azureDisk - created new MD Name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 StorageAccountType:Standard_LRS Size:1
I0409 13:35:42.158780 1 controllerserver.go:294] create azure disk(pvc-53087397-597f-41aa-bccb-e2d8ce450626) account type(Standard_LRS) rg(obsidian-flier) location(westus2) size(1) tags(map[kubernetes.io-created-for-pv-name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 kubernetes.io-created-for-pvc-name:www-sample-statefulset-1 kubernetes.io-created-for-pvc-namespace:od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf]) successfully
I0409 13:35:42.158843 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=2.540643032 request="azuredisk_csi_driver_controller_create_volume" resource_group="obsidian-flier" subscription_id="066b2cf2-eca1-4e23-98b4-ce94c72b99b3" source="disk.csi.azure.com" volumeid="/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626" result_code="succeeded"
I0409 13:35:42.158867 1 utils.go:84] GRPC response: {"volume":{"accessible_topology":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1"}}],"capacity_bytes":1073741824,"content_source":{"Type":null},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","csi.storage.k8s.io/pvc/name":"www-sample-statefulset-1","csi.storage.k8s.io/pvc/namespace":"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf","kind":"Managed","requestedsizegib":"1","storageaccounttype":"Standard_LRS"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}}
I0409 13:35:44.853034 1 utils.go:78] GRPC request: {"node_id":"849574ed-eef0-49a3-aef2-a03c1737cd4a","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"ReadWrite","fstype":"","kind":"Managed"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}
I0409 13:35:44.891972 1 controllerserver.go:401] GetDiskLun returned: cannot find Lun for disk pvc-53087397-597f-41aa-bccb-e2d8ce450626. Initiating attaching volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a (vmState Succeeded).
I0409 13:35:44.892007 1 controllerserver.go:426] Trying to attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a
I0409 13:35:44.892032 1 azure_controller_common.go:255] wait 1000ms for more requests on node 849574ed-eef0-49a3-aef2-a03c1737cd4a, current disk attach: /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:45.892758 1 azure_controller_common.go:269] Trying to attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 lun 2 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a, diskMap len:1, map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626 false 2})]
I0409 13:35:45.892866 1 azure_controller_standard.go:103] azureDisk - update(obsidian-flier): vm(849574ed-eef0-49a3-aef2-a03c1737cd4a) - attach disk list(map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626 false 2})])
I0409 13:35:46.399029 1 azure_controller_standard.go:116] azureDisk - update(obsidian-flier): vm(849574ed-eef0-49a3-aef2-a03c1737cd4a) - attach disk list(map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626 false 2})]) returned with <nil>
I0409 13:35:56.607788 1 azure_controller_common.go:558] azureDisk - find disk: lun 2 name pvc-53087397-597f-41aa-bccb-e2d8ce450626 uri /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:56.607819 1 controllerserver.go:436] Attach operation successful: volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 attached to node 849574ed-eef0-49a3-aef2-a03c1737cd4a.
I0409 13:35:56.607841 1 controllerserver.go:456] attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a successfully
I0409 13:35:56.607904 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=11.715902039 request="azuredisk_csi_driver_controller_publish_volume" resource_group="obsidian-flier" subscription_id="066b2cf2-eca1-4e23-98b4-ce94c72b99b3" source="disk.csi.azure.com" volumeid="/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626" node="849574ed-eef0-49a3-aef2-a03c1737cd4a" result_code="succeeded"
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system | grep pvc-53087397-597f-41aa-bccb-e2d8ce450626
Defaulted container "liveness-probe" out of: liveness-probe, node-driver-registrar, azuredisk
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system -c azuredisk| grep pvc-53087397-597f-41aa-bccb-e2d8ce450626
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system -c azuredisk| grep pvc-53087397-597f-41aa-bcc^Ce2d8ce450626
CSInode.log
I0410 06:48:42.302185 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:42.302208 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:42.302213 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:42.302218 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:42.302222 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:42.302226 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:43.302264 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:43.302287 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:43.302292 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:43.302296 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:43.302300 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:43.302304 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:44.302389 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:44.302414 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:44.302419 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:44.302423 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:44.302426 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:44.302430 1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:44.729773 1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0410 06:48:44.729791 1 utils.go:78] GRPC request: {"publish_context":{"LUN":"2","diskiopsreadwrite":"500","diskmbpsreadwrite":"60","networkaccesspolicy":"AllowAll","skuname":"Standard_LRS"},"staging_target_path":"/var/vcap/data/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ad2407d46e6a6d65ad8f32685e88d02e2ff433ef4fb076f20ad79eefded4d947/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volue_context":{"cachingmode":"ReadWrite","fstype":"","kind":"Managed"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}
What you expected to happen: The pod should running How to reproduce it: Happens occasionally, recreate the worker node usually resolve this issue Anything else we need to know?:
Environment:
- CSI Driver version:
kubectl get po -n kube-system -o yaml | grep mcr | grep azuredisk
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
- Kubernetes version (use
kubectl version): - kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.12", GitCommit:"12031002905c0410706974560cbdf2dad9278919", GitTreeState:"clean", BuildDate:"2024-03-15T02:15:31Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.12+vmware.1", GitCommit:"2eb649beaf2d31b1873b8ff68a1bffe984f8ec05", GitTreeState:"clean", BuildDate:"2024-03-27T07:10:53Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from /etc/os-release):
- worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
- Kernel (e.g.
uname -a):
Linux 849574ed-eef0-49a3-aef2-a03c1737cd4a 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
- Others:
ssh to the worker node, we can see the devices(sdd sde) are there, but not staged.
worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
fd0 2:0 1 4K 0 disk
sda 8:0 0 403G 0 disk
├─sda1 8:1 0 4.8G 0 part /home
│ /
└─sda2 8:2 0 398.2G 0 part /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/host-var-log-antrea/antrea-ovs/1
/var/vcap/data/kubelet/pods/9f022aaf-d7b0-477a-b153-9c59404a403f/volume-subpaths/antrea-config/antrea-controller/0
/var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/antrea-agent/0
/var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/install-cni/0
/var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
/var/tmp
/tmp
/opt
/var/opt
/var/log
/var/vcap/data
sdb 8:16 0 64G 0 disk
└─sdb1 8:17 0 64G 0 part
sdc 8:32 0 50G 0 disk
└─sdc1 8:33 0 50G 0 part /var/vcap/store
sdd 8:48 0 1G 0 disk
sde 8:64 0 1G 0 disk
sr0 11:0 1 648K 0 rom
worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/sys/bus/scsi/devices$ ls -al
total 0
drwxr-xr-x 2 root root 0 Apr 9 08:53 .
drwxr-xr-x 4 root root 0 Apr 9 08:53 ..
lrwxrwxrwx 1 root root 0 Apr 9 08:53 1:0:0:0 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1/target1:0:0/1:0:0:0
lrwxrwxrwx 1 root root 0 Apr 9 09:01 2:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:0
lrwxrwxrwx 1 root root 0 Apr 9 13:35 2:0:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:1
lrwxrwxrwx 1 root root 0 Apr 9 13:35 2:0:0:2 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:2
lrwxrwxrwx 1 root root 0 Apr 9 08:53 4:0:1:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4/target4:0:1/4:0:1:0
lrwxrwxrwx 1 root root 0 Apr 9 08:53 5:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5/target5:0:0/5:0:0:0
lrwxrwxrwx 1 root root 0 Apr 9 08:53 host0 -> ../../../devices/pci0000:00/0000:00:07.1/ata1/host0
lrwxrwxrwx 1 root root 0 Apr 9 08:53 host1 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1
lrwxrwxrwx 1 root root 0 Apr 9 08:53 host2 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2
lrwxrwxrwx 1 root root 0 Apr 9 08:53 host3 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781a-1e82-4818-a1c3-63d806ec15bb/host3
lrwxrwxrwx 1 root root 0 Apr 9 08:53 host4 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4
lrwxrwxrwx 1 root root 0 Apr 9 08:53 host5 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5
lrwxrwxrwx 1 root root 0 Apr 9 08:53 target1:0:0 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1/target1:0:0
lrwxrwxrwx 1 root root 0 Apr 10 07:21 target2:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0
lrwxrwxrwx 1 root root 0 Apr 9 08:53 target4:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4/target4:0:1
lrwxrwxrwx 1 root root 0 Apr 9 08:53 target5:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5/target5:0:0
azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
from the error msg, it seems you are missing such udev rule on the node, follow guide here to generate the udev rule on the node: https://github.com/kubernetes-sigs/azuredisk-csi-driver/issues/1631#issuecomment-1372299254, Azure VM should have such built-in rule by default, is your node type special? @xiaoping8385
There is no such rule, and not all the node have the same issue, for eg, we found one of the nodes can mount pv successfully, and there is also no such built-in-rule, In normal node:
> resize2fs /dev/sdd
resize2fs 1.46.5 (30-Dec-2021)
The filesystem is already 262144 (4k) blocks long. Nothing to do!
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
fd0 2:0 1 4K 0 disk
sda 8:0 0 64G 0 disk
└─sda1 8:1 0 64G 0 part
sdb 8:16 0 403G 0 disk
├─sdb1 8:17 0 4.8G 0 part /home
│ /
└─sdb2 8:18 0 398.2G 0 part /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/host-var-log-antrea/antrea-ovs/1
/var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-config/antrea-agent/0
/var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-config/install-cni/0
/var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
/var/tmp
/tmp
/opt
/var/opt
/var/log
/var/vcap/data
sdc 8:32 0 50G 0 disk
└─sdc1 8:33 0 50G 0 part /var/vcap/store
sdd 8:48 0 1G 0 disk /var/vcap/data/kubelet/pods/fa2d11c4-ef95-47ef-b2b8-7b5fbf8ed279/volumes/kubernetes.io~csi/pvc-684f4e73-681d-4164-97d7-5440bafc950b/mount
/var/vcap/data/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/c4d484f06606c7650a1b33528b67e8221f3c74b68b45db9bd301ad189c62a2eb/globalmount
sr0 11:0 1 648K 0 rom
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:~$ cd /etc/udev/rules.d/
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ ls
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ ls -al
total 8
drwxr-xr-x 2 root root 4096 Apr 7 2022 .
drwxr-xr-x 4 root root 4096 Mar 18 12:14 ..
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$
In problematic node,
worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
fd0 2:0 1 4K 0 disk
sda 8:0 0 403G 0 disk
├─sda1 8:1 0 4.8G 0 part /home
│ /
└─sda2 8:2 0 398.2G 0 part /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/host-var-log-antrea/antrea-ovs/1
/var/vcap/data/kubelet/pods/9f022aaf-d7b0-477a-b153-9c59404a403f/volume-subpaths/antrea-config/antrea-controller/0
/var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/antrea-agent/0
/var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/install-cni/0
/var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
/var/tmp
/tmp
/opt
/var/opt
/var/log
/var/vcap/data
sdb 8:16 0 64G 0 disk
└─sdb1 8:17 0 64G 0 part
sdc 8:32 0 50G 0 disk
└─sdc1 8:33 0 50G 0 part /var/vcap/store
sdd 8:48 0 1G 0 disk
sde 8:64 0 1G 0 disk
worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/var/vcap/bosh_ssh/bosh_79bac66d3614401# resize2fs /dev/sdd
resize2fs 1.46.5 (30-Dec-2021)
resize2fs: Bad magic number in super-block while trying to open /dev/sdd
Couldn't find valid filesystem superblock.
worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/etc/udev/rules.d# resize2fs /dev/sde
resize2fs 1.46.5 (30-Dec-2021)
resize2fs: Bad magic number in super-block while trying to open /dev/sde
Couldn't find valid filesystem superblock.
Both node are created from the same template
here is the data disk parsing logic, the reliable way is creating that udev rule on the node, that would create device links under /dev/disk/azure
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/99a117cfe0ff682d364215f6e6ab9e056a5b6339/pkg/azuredisk/azure_common_linux.go#L120-L194
Who should be responsible to creating the udev rule? I thought the vm is created by azure cloud. and it can not explain that some of the nodes work and some doesn't
can you try creating the udev rule on one node and check whether it works? @xiaoping8385 if it's working, then I could provide you a daemonset to creating udev rule on every node.
So i need to create a script (66-azure-storage.rules )as described in #1631 comment and run that?
@xiaoping8385 just create that script, and it would work automatically. and then reschedule pod with disk volume onto that node.
@andyzhangx Still not work, I deployed the script, and delete the pod, and also restart the csi-azuredisk-node, but the pod still not running. Is that possible we have a zoom or chat?
@xiaoping8385 can you run following command on the node, and try again?
This command will reload the udev rules and trigger the creation of any new device nodes that match the rules.
sudo udevadm control --reload-rules && sudo udevadm trigger
it works finally, thanks, and one more question, how should we prevent this issue in future or do we need an extra deployment? what's the root cause for this issue
it works finally, thanks, and one more question, how should we prevent this issue in future or do we need an extra deployment? what's the root cause for this issue
@xiaoping8385 per this doc, the linux VM should have udev rules: https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/linux/troubleshoot-device-names-problems#solution
What is your k8s cluster? are you manually setting up k8s cluster based on Azure Ubuntu VMs?
@andyzhangx Sorry for late reply, Our k8s cluster is actually created by bosh director, which has a CPI to talk with azure platform to create vms, I don't think there is a udev rules in our envs,as I mentioned before,we have two nodes without such rule, but in one node the volume can be successfully staged, but the other don't. is there other ways other than udev rules that help to stage the volume?
@andyzhangx Sorry for late reply, Our k8s cluster is actually created by bosh director, which has a CPI to talk with azure platform to create vms, I don't think there is a udev rules in our envs,as I mentioned before,we have two nodes without such rule, but in one node the volume can be successfully staged, but the other don't. is there other ways other than udev rules that help to stage the volume?
@xiaoping8385 using udev rules are the most reliable way to detect a data disk correctly. Are you able to create a daemonset to write udev rules on every node?
sometimes yes, since this issue is happened occasionally.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.