azuredisk-csi-driver
azuredisk-csi-driver copied to clipboard
Mounting Disks under NVMe diskcontroller in windows failes
What happened: Trying to mount a managed disk on a nvme diskcontroller vm failes
I0620 07:48:36.892166 6464 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0620 07:48:36.892166 6464 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\disk.csi.azure.com\\3a07bbd56bedf026817504b649086872043fb4a71d1a81b17de2e82d86563b52\\globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ntfs"}},"access_mode":{"mode":7}},"volume_context":{"cachingMode":"ReadOnly","csi.storage.k8s.io/pv/name":"pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b","csi.storage.k8s.io/pvc/name":"mypod","csi.storage.k8s.io/pvc/namespace":"myns,"fsType":"ntfs","kind":"Managed","requestedsizegib":"512","skuName":"Premium_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1718807269317-6827-disk.csi.azure.com"},"volume_id":"/subscriptions/<subscription>/resourceGroups/myrg/providers/Microsoft.Compute/disks/pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b"}
Warning FailedMount 4m49s (x49 over 89m) kubelet MountVolume.MountDevice failed for volume "pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b" : rpc error: code = Internal desc = failed to find disk on lun 0. azureDisk - findDiskByLun(0) failed with error(could not find disk id for lun: 0)
What you expected to happen: provide the pvc to the pod
How to reproduce it:
try to attach an azuredisk to a windows kubernetes node of type Standard_D4alds_v6
Anything else we need to know?:
Environment:
- CSI Driver version: v1.29.2
- Kubernetes version (use
kubectl version): v1.28.5 - OS (e.g. from /etc/os-release): windows server 2019/2022
- Others: csi-proxy 1.1.2
could it always repro on Standard_D4alds_v6 windows vm sku?
hey @andyzhangx i've tried it 4-5 times with different machines in a vmss. I think there have been some changes on how managed disks are attached to the those VMs. Maybe this helps:
Managed disk on Standard_D96ads_v5:
get-disk
Number Friendly Name Serial Number HealthStatus OperationalStatus Total Size Partition
Style
------ ------------- ------------- ------------ ----------------- ---------- ----------
...
11 Msft Virtual Disk Healthy Online 512 GB GPT
...
ConvertTo-Json @(Get-Disk | select Number, Location)
[
...
{
"Number": 11,
"Location": "Integrated : Adapter 3 : Port 0 : Target 0 : LUN 0"
},
...
on the Standard_D96alds_v6:
Get-Disk
Number Friendly Name Serial Number HealthStatus OperationalStatus Total Size Partition
Style
------ ------------- ------------- ------------ ----------------- ---------- ----------
...
12 MSFT NVMe Accelerator v1.0 B91B_DB34_FB4F_48EE_AC80_7234... Healthy Online 512 GB GPT
ConvertTo-Json @(Get-Disk | select Number, Location)
[
...
{
"Number": 12,
"Location": "Integrated : Adapter 0"
}
...
I've removed the non-related entries to keep it simple and replaced them with ...
@Flask so on Standard_D96alds_v6, is disk num 12 a managed data disk? the is Friendly Name of that disk is MSFT NVMe Accelerator v1.0 , and that disk does not have lun num mapping as Standard_D96ads_v5, e.g. "Location": "Integrated : Adapter 3 : Port 0 : Target 0 : LUN 0"
Exactly. Storage class is in both cases:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ssd-ntfs
parameters:
cachingMode: ReadOnly
fsType: ntfs
kind: managed
skuName: Premium_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
@Flask I think there is sth. wrong with the windows vm internal config for this vm sku. Can you file a support ticket to Azure windows VM team? thx
On linux, there should be a udev rule to detect data disk automatically: https://github.com/kubernetes-sigs/azuredisk-csi-driver/issues/2034#issuecomment-1854095537 I think Windows VM should also have similar udev rule on this VM sku.
FYI. the nvme disk is already supported on Linux node with v1.30.3 release, still need to figure how to get the <lun, disk-num> mapping on Windows node.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale