azuredisk-csi-driver icon indicating copy to clipboard operation
azuredisk-csi-driver copied to clipboard

Mounting Disks under NVMe diskcontroller in windows failes

Open Flask opened this issue 1 year ago • 7 comments

What happened: Trying to mount a managed disk on a nvme diskcontroller vm failes

I0620 07:48:36.892166    6464 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0620 07:48:36.892166    6464 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\disk.csi.azure.com\\3a07bbd56bedf026817504b649086872043fb4a71d1a81b17de2e82d86563b52\\globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ntfs"}},"access_mode":{"mode":7}},"volume_context":{"cachingMode":"ReadOnly","csi.storage.k8s.io/pv/name":"pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b","csi.storage.k8s.io/pvc/name":"mypod","csi.storage.k8s.io/pvc/namespace":"myns,"fsType":"ntfs","kind":"Managed","requestedsizegib":"512","skuName":"Premium_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1718807269317-6827-disk.csi.azure.com"},"volume_id":"/subscriptions/<subscription>/resourceGroups/myrg/providers/Microsoft.Compute/disks/pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b"}

Warning FailedMount 4m49s (x49 over 89m) kubelet MountVolume.MountDevice failed for volume "pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b" : rpc error: code = Internal desc = failed to find disk on lun 0. azureDisk - findDiskByLun(0) failed with error(could not find disk id for lun: 0)

What you expected to happen: provide the pvc to the pod

How to reproduce it: try to attach an azuredisk to a windows kubernetes node of type Standard_D4alds_v6

Anything else we need to know?:

Environment:

  • CSI Driver version: v1.29.2
  • Kubernetes version (use kubectl version): v1.28.5
  • OS (e.g. from /etc/os-release): windows server 2019/2022
  • Others: csi-proxy 1.1.2

Flask avatar Jun 20 '24 09:06 Flask

could it always repro on Standard_D4alds_v6 windows vm sku?

andyzhangx avatar Jun 20 '24 15:06 andyzhangx

hey @andyzhangx i've tried it 4-5 times with different machines in a vmss. I think there have been some changes on how managed disks are attached to the those VMs. Maybe this helps:

Managed disk on Standard_D96ads_v5:

get-disk                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                             
Number Friendly Name                                                                                                                                      Serial Number                    HealthStatus         OperationalStatus      Total Size Partition  
                                                                                                                                                                                                                                                  Style      
------ -------------                                                                                                                                      -------------                    ------------         -----------------      ---------- ---------- 
...
11     Msft Virtual Disk                                                                                                                                                                   Healthy              Online                     512 GB GPT 
...
ConvertTo-Json @(Get-Disk | select Number, Location)  
[                                                                                                                                                                                                                                                            
    ...                                                                                                                                                                                                                                 
    {                                                                                                                                                                                                                                                        
        "Number":  11,                                                                                                                                                                                                                                       
        "Location":  "Integrated : Adapter 3 : Port 0 : Target 0 : LUN 0"                                                                                                                                                                                    
    },
    ...

on the Standard_D96alds_v6:

Get-Disk                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                             
Number Friendly Name                                                                                                                                      Serial Number                    HealthStatus         OperationalStatus      Total Size Partition  
                                                                                                                                                                                                                                                  Style      
------ -------------                                                                                                                                      -------------                    ------------         -----------------      ---------- ---------- 
...      
12     MSFT NVMe Accelerator v1.0                                                                                                                         B91B_DB34_FB4F_48EE_AC80_7234... Healthy              Online                     512 GB GPT        
 
ConvertTo-Json @(Get-Disk | select Number, Location)  
[                                                                                                                                                                                                                                                            
    ...                                                                                                                                                                                                                                 
    {                                                                                                                                                                                                                                                        
        "Number":  12,                                                                                                                                                                                                                                       
        "Location":  "Integrated : Adapter 0"                                                                                                                                                                                                                
    }  
    ...

I've removed the non-related entries to keep it simple and replaced them with ...

Flask avatar Jun 20 '24 15:06 Flask

@Flask so on Standard_D96alds_v6, is disk num 12 a managed data disk? the is Friendly Name of that disk is MSFT NVMe Accelerator v1.0 , and that disk does not have lun num mapping as Standard_D96ads_v5, e.g. "Location": "Integrated : Adapter 3 : Port 0 : Target 0 : LUN 0"

andyzhangx avatar Jun 23 '24 07:06 andyzhangx

Exactly. Storage class is in both cases:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd-ntfs
parameters:
  cachingMode: ReadOnly
  fsType: ntfs
  kind: managed
  skuName: Premium_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Flask avatar Jun 24 '24 06:06 Flask

@Flask I think there is sth. wrong with the windows vm internal config for this vm sku. Can you file a support ticket to Azure windows VM team? thx

andyzhangx avatar Jun 28 '24 01:06 andyzhangx

On linux, there should be a udev rule to detect data disk automatically: https://github.com/kubernetes-sigs/azuredisk-csi-driver/issues/2034#issuecomment-1854095537 I think Windows VM should also have similar udev rule on this VM sku.

andyzhangx avatar Jul 08 '24 01:07 andyzhangx

FYI. the nvme disk is already supported on Linux node with v1.30.3 release, still need to figure how to get the <lun, disk-num> mapping on Windows node.

andyzhangx avatar Aug 01 '24 09:08 andyzhangx

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 30 '24 09:10 k8s-triage-robot