AKS [BUG] Slow ephemeral OS disk compared to regular VM

Describe the bug

We are using an AKS cluster with KEDA to run our Azure Pipelines jobs. As our build jobs are quite IO-intensive, we chose to use Standard_D32ads_v5 machines with Ephemeral OS disks. This allows us to put all workload on the temporary disk of the machines and simplifies our setup.

While conducting benchmarks with the fio commands from the docs to verify IOPS performance, we observed unexpected behavior with the AKS nodes:

$ fio --name=test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
$ fio --name=test --rw=randread --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
$ fio --name=test --rw=randwrite --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting

On average, we only achieve 80k IOPS instead of the specified 150k IOPS. This result made us curious, so we re-ran the same commands while directly SSH-ing into one of our nodes to eliminate any pipeline/container/Kubernetes overhead, and we obtained the same results. We then repeated the same tests on a separate Azure VM with the same settings (Standard_D32ads_v5 on Ephemeral OS disk in the same region), and we actually achieved the specified 150k IOPS. It appears that the Ephemeral OS disks on AKS-managed VMs are not as fast as those on regular Azure VMs. Are we missing something?

To Reproduce Steps to reproduce the behavior:

Setup a AKS cluster with Standard_D32ads_v5 machines and ephemeral OS disks
Run above commands
Check avg. IOPS

Expected behavior I expect to achieve the same level of performance on AKS nodes as on regular Azure VMs with ephemeral OS disks.

Screenshots N/A.

Environment (please complete the following information):

CLI Version: 2.50.0
Kubernetes version: 1.25.6

Additional context N/A.

Jul 13 '23 14:07 skycaptain

hmmm. you're not missing something.

what's the exact configuration when you run the test? are you writing into the pod root overlayfs, host mounting the disk, emptyDir, etc?

Jul 13 '23 18:07 alexeldeib

what's the exact configuration when you run the test? are you writing into the pod root overlayfs, host mounting the disk, emptyDir, etc?

We use emptyDir for our workload. However, since we initially suspected that the problem might be with overlayfs or other drivers, we used kubectl-exec to get a shell directly on the node to eliminate any overhead.

These are the commands for a minimal reproducible example:

# Create AKS cluster in West Europe
$ az aks create --name myAKSCluster --resource-group myResourceGroup -s Standard_D32ads_v5 --node-osdisk-type Ephemeral --node-osdisk-size 1200
# Get credentials for kubectl
$ az aks get-credentials --name myAKSCluster --resource-group myResourceGroup

Then, use kubectl-exec to get a shell directly on the node:

# Use kubectl-exec to get a shell on the node
$ ./kubectl-exec
root@aks-nodepool1-26530957-vmss000000:/# apt update && apt install fio
root@aks-nodepool1-26530957-vmss000000:/# fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 4 processes
Jobs: 4 (f=4): [m(4)][100.0%][r=160MiB/s,w=158MiB/s][r=40.9k,w=40.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=8515: Fri Jul 14 08:30:54 2023
  read: IOPS=40.8k, BW=159MiB/s (167MB/s)(4784MiB/30013msec)
    slat (nsec): min=1402, max=24327k, avg=35905.57, stdev=301878.58
    clat (usec): min=101, max=92076, avg=12429.66, stdev=9985.28
     lat (usec): min=108, max=92079, avg=12465.66, stdev=10006.77
    clat percentiles (usec):
     |  1.00th=[  955],  5.00th=[ 1631], 10.00th=[ 2147], 20.00th=[ 3458],
     | 30.00th=[ 5342], 40.00th=[ 7570], 50.00th=[10028], 60.00th=[12780],
     | 70.00th=[15795], 80.00th=[20055], 90.00th=[26346], 95.00th=[32113],
     | 99.00th=[43779], 99.50th=[48497], 99.90th=[58983], 99.95th=[64750],
     | 99.99th=[72877]
   bw (  KiB/s): min=96016, max=328424, per=100.00%, avg=163342.64, stdev=12166.36, samples=236
   iops        : min=24004, max=82106, avg=40835.66, stdev=3041.59, samples=236
  write: IOPS=40.8k, BW=159MiB/s (167MB/s)(4786MiB/30013msec); 0 zone resets
    slat (nsec): min=1502, max=25595k, avg=36374.11, stdev=303592.46
    clat (usec): min=62, max=92076, avg=12580.36, stdev=10015.75
     lat (usec): min=68, max=92079, avg=12616.83, stdev=10037.27
    clat percentiles (usec):
     |  1.00th=[  963],  5.00th=[ 1647], 10.00th=[ 2212], 20.00th=[ 3589],
     | 30.00th=[ 5538], 40.00th=[ 7767], 50.00th=[10290], 60.00th=[12911],
     | 70.00th=[16057], 80.00th=[20055], 90.00th=[26608], 95.00th=[32113],
     | 99.00th=[43779], 99.50th=[48497], 99.90th=[58983], 99.95th=[63701],
     | 99.99th=[72877]
   bw (  KiB/s): min=95680, max=326320, per=100.00%, avg=163413.15, stdev=12149.27, samples=236
   iops        : min=23920, max=81580, avg=40853.29, stdev=3037.32, samples=236
  lat (usec)   : 100=0.01%, 250=0.04%, 500=0.17%, 750=0.33%, 1000=0.57%
  lat (msec)   : 2=7.11%, 4=14.49%, 10=26.71%, 20=30.52%, 50=19.65%
  lat (msec)   : 100=0.40%
  cpu          : usr=2.32%, sys=8.72%, ctx=800877, majf=0, minf=70
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1224732,1225234,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=4784MiB (5017MB), run=30013-30013msec
  WRITE: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=4786MiB (5019MB), run=30013-30013msec

Disk stats (read/write):
  sda: ios=1219919/1220277, merge=2/87, ticks=11318790/11224205, in_queue=22543101, util=99.73%

Jul 14 '23 08:07 skycaptain

AKS with ephemeral OS disk use VMSS with Standard SSD disks. It could be the cause of the performance issue.

Jul 17 '23 06:07 daalse

Hmm. Could you please provide a reference? I have not found any mention of this limitation in the documentation and was expecting that it would run on the same SSDs as regular Azure VMs. In fact, they mention this in their blog:

[...] In addition, the ephemeral OS disk will share the IOPS with the temporary storage disk as per the VM size you selected. Ephemeral disks also require that the VM size supports Premium storage. The sizes usually have an s in the name, like DSv2 and EsV3. For more information, see Azure VM sizes for details around which sizes support Premium storage.

[...] This VM Series supports both VM cache and temporary storage SSD. High Scale VMs like DSv2-series that leverage Azure Premium Storage have a multi-tier caching technology called BlobCache. BlobCache uses a combination of the host RAM and local SSD for caching. This cache is available for the Premium Storage persistent disks and VM local disks. The VM cache can be used for hosting an ephemeral OS disk. When a VM series supports the VM cache, its size depends on the VM series and VM size. The VM cache size is indicated in parentheses next to IO throughput ("cache size in GiB").

Jul 17 '23 08:07 skycaptain

Not sure if it is explained somewhere, but I have 2 different AKS clusters deployed, one with ephemeral OS disk. Checking its vmss configuration, I can see that the OS disk type attached is: Standard HDD LRS

On the other hand, for the other AKS cluster, the OS disk type attached is: Premium SSD LRS.

Jul 17 '23 08:07 daalse

Unfortunately, I don't have access to the internal Resource Group created by AKS. This is due to our IT department limiting access permissions, so I can only see the AKS resource in my own resource group. However, I believe this might just be a rendering issue, as I can see the same value with regular VMs. It also says "Standard HDD LRS" but also "150000 IOPS".

Screenshot 2023-07-17 at 11 27 35

Jul 17 '23 09:07 skycaptain

the ssd thing is an implementation detail, the backing image is Standard HDD pulled to local disk

this is on my to-do list to repro but dealing with cgroupv2 issues a bit first :)

Jul 17 '23 22:07 alexeldeib

specifically

We use emptyDir for our workload

this is surprising(ly bad performance for empty dir)

Jul 17 '23 22:07 alexeldeib

We have some AKS clusters in worse ephemeral disk performance.

# fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.1
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=97.3MiB/s,w=97.5MiB/s][r=24.9k,w=24.0k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=3753471: Fri Feb  9 16:56:17 2024
   read: IOPS=27.0k, BW=109MiB/s (115MB/s)(3280MiB/30016msec)
    slat (nsec): min=1400, max=85393k, avg=36894.18, stdev=430688.70
    clat (usec): min=101, max=404282, avg=18366.00, stdev=17828.60
     lat (usec): min=113, max=404284, avg=18403.03, stdev=17831.82
    clat percentiles (usec):
     |  1.00th=[  1074],  5.00th=[  3163], 10.00th=[  5014], 20.00th=[  7832],
     | 30.00th=[  9896], 40.00th=[ 11731], 50.00th=[ 13829], 60.00th=[ 16319],
     | 70.00th=[ 19792], 80.00th=[ 25035], 90.00th=[ 35390], 95.00th=[ 47449],
     | 99.00th=[ 85459], 99.50th=[106431], 99.90th=[179307], 99.95th=[252707],
     | 99.99th=[371196]
   bw (  KiB/s): min=10024, max=38168, per=25.00%, avg=27975.48, stdev=4753.45, samples=240
   iops        : min= 2506, max= 9542, avg=6993.83, stdev=1188.36, samples=240
  write: IOPS=28.0k, BW=109MiB/s (115MB/s)(3286MiB/30016msec)
    slat (nsec): min=1600, max=80643k, avg=65559.22, stdev=532054.65
    clat (usec): min=37, max=404190, avg=18097.22, stdev=17720.12
     lat (usec): min=61, max=404201, avg=18162.95, stdev=17718.90
    clat percentiles (usec):
     |  1.00th=[   938],  5.00th=[  2933], 10.00th=[  4752], 20.00th=[  7570],
     | 30.00th=[  9634], 40.00th=[ 11469], 50.00th=[ 13566], 60.00th=[ 16057],
     | 70.00th=[ 19530], 80.00th=[ 24773], 90.00th=[ 34866], 95.00th=[ 47449],
     | 99.00th=[ 85459], 99.50th=[104334], 99.90th=[179307], 99.95th=[248513],
     | 99.99th=[371196]
   bw (  KiB/s): min=10008, max=37992, per=25.01%, avg=28029.28, stdev=4702.18, samples=240
   iops        : min= 2502, max= 9498, avg=7007.27, stdev=1175.56, samples=240
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.09%, 500=0.25%, 750=0.30%
  lat (usec)   : 1000=0.34%
  lat (msec)   : 2=1.70%, 4=4.71%, 10=23.97%, 20=39.31%, 50=24.95%
  lat (msec)   : 100=3.79%, 250=0.54%, 500=0.05%
  cpu          : usr=2.26%, sys=18.58%, ctx=1159484, majf=0, minf=8305
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwt: total=839618,841158,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=3280MiB (3439MB), run=30016-30016msec
  WRITE: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=3286MiB (3445MB), run=30016-30016msec

Disk stats (read/write):
  sda: ios=834701/836459, merge=2/156, ticks=10581685/10425060, in_queue=17674896, util=99.71%

Feb 12 '24 11:02 fsniper

Thanks for opening this issue. We experience the same and could reproduce it on standard VMs.

Azure Support Request ID: 2403070050000924

Windows

windows VM (instance type Standard_D96ads_v5):

                "storageProfile": {
                    "osDisk": {
                        "createOption": "fromImage",
                        "diskSizeGB": 2040,
                        "managedDisk": {
                            "storageAccountType": "Standard_LRS"
                        },
                        "caching": "ReadOnly",
                        "diffDiskSettings": {
                            "option": "Local",
                            "placement": "ResourceDisk"
                        },
                        "deleteOption": "Delete"
                    },
                    "imageReference": {
                        "publisher": "MicrosoftWindowsServer",
                        "offer": "WindowsServer",
                        "sku": "2019-datacenter-gensecond",
                        "version": "latest"
                    }
                },

Result on OS disk drive (C:):

PS C:\Users\redacted>  fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=windowsaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=256
...
fio-3.36
Starting 4 threads
test: Laying out IO files (2 files / total 30720MiB)
Jobs: 4 (f=8): [m(4)][100.0%][r=159MiB/s,w=157MiB/s][r=40.7k,w=40.3k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=3460: Wed Mar 6 16:04:01 2024
  read: IOPS=37.5k, BW=147MiB/s (154MB/s)(4397MiB/30013msec)
    slat (nsec): min=3700, max=90600, avg=6934.13, stdev=1319.90
    clat (msec): min=2, max=836, avg=12.93, stdev= 2.65
     lat (msec): min=2, max=836, avg=12.93, stdev= 2.65
    clat percentiles (msec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   13],
     | 30.00th=[   13], 40.00th=[   13], 50.00th=[   13], 60.00th=[   14],
     | 70.00th=[   14], 80.00th=[   14], 90.00th=[   14], 95.00th=[   15],
     | 99.00th=[   15], 99.50th=[   16], 99.90th=[   17], 99.95th=[   22],
     | 99.99th=[  124]
   bw (  KiB/s): min=112901, max=165761, per=100.00%, avg=158036.82, stdev=2210.54, samples=224
   iops        : min=28223, max=41439, avg=39508.04, stdev=552.64, samples=224
  write: IOPS=37.6k, BW=147MiB/s (154MB/s)(4402MiB/30013msec); 0 zone resets
    slat (usec): min=4, max=472, avg= 7.32, stdev= 1.42
    clat (msec): min=2, max=835, avg=12.87, stdev= 2.74
     lat (msec): min=2, max=835, avg=12.88, stdev= 2.74
    clat percentiles (msec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   12], 20.00th=[   13],
     | 30.00th=[   13], 40.00th=[   13], 50.00th=[   13], 60.00th=[   13],
     | 70.00th=[   14], 80.00th=[   14], 90.00th=[   14], 95.00th=[   15],
     | 99.00th=[   15], 99.50th=[   16], 99.90th=[   18], 99.95th=[   24],
     | 99.99th=[  124]
   bw (  KiB/s): min=116413, max=165865, per=100.00%, avg=158249.00, stdev=2121.27, samples=224
   iops        : min=29102, max=41464, avg=39561.23, stdev=530.30, samples=224
  lat (msec)   : 4=0.01%, 10=0.10%, 20=99.83%, 50=0.01%, 250=0.05%
  lat (msec)   : 1000=0.01%
  cpu          : usr=0.00%, sys=12.50%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
     issued rwts: total=1125614,1127004,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=4397MiB (4611MB), run=30013-30013msec
  WRITE: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=4402MiB (4616MB), run=30013-30013msec

Result on tmp disk drive (D:):

PS D:\>  fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=windowsaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=256
...
fio-3.36
Starting 4 threads
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=659MiB/s,w=663MiB/s][r=169k,w=170k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7160: Wed Mar 6 16:05:24 2024
  read: IOPS=138k, BW=538MiB/s (565MB/s)(15.8GiB/30003msec)
    slat (usec): min=2, max=5876, avg= 8.82, stdev=49.05
    clat (usec): min=76, max=155071, avg=3126.87, stdev=8496.17
     lat (usec): min=114, max=155490, avg=3135.70, stdev=8529.38
    clat percentiles (usec):
     |  1.00th=[   979],  5.00th=[  1385], 10.00th=[  1532], 20.00th=[  1811],
     | 30.00th=[  2057], 40.00th=[  2212], 50.00th=[  2409], 60.00th=[  2671],
     | 70.00th=[  2802], 80.00th=[  2933], 90.00th=[  3228], 95.00th=[  5211],
     | 99.00th=[ 10945], 99.50th=[ 15008], 99.90th=[149947], 99.95th=[149947],
     | 99.99th=[152044]
   bw (  KiB/s): min=  572, max=940577, per=100.00%, avg=569406.32, stdev=72308.44, samples=228
   iops        : min=  140, max=235144, avg=142350.51, stdev=18077.13, samples=228
  write: IOPS=138k, BW=538MiB/s (564MB/s)(15.8GiB/30003msec); 0 zone resets
    slat (usec): min=2, max=6530, avg= 9.46, stdev=50.19
    clat (usec): min=8, max=155215, avg=3079.78, stdev=8620.81
     lat (usec): min=42, max=156215, avg=3089.24, stdev=8654.57
    clat percentiles (usec):
     |  1.00th=[   914],  5.00th=[  1319], 10.00th=[  1483], 20.00th=[  1745],
     | 30.00th=[  1991], 40.00th=[  2147], 50.00th=[  2343], 60.00th=[  2606],
     | 70.00th=[  2737], 80.00th=[  2868], 90.00th=[  3163], 95.00th=[  5145],
     | 99.00th=[ 10945], 99.50th=[ 15270], 99.90th=[149947], 99.95th=[149947],
     | 99.99th=[152044]
   bw (  KiB/s): min=  708, max=942222, per=100.00%, avg=569302.63, stdev=72263.71, samples=228
   iops        : min=  174, max=235555, avg=142324.65, stdev=18065.98, samples=228
  lat (usec)   : 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.02%
  lat (usec)   : 750=0.22%, 1000=1.08%
  lat (msec)   : 2=27.76%, 4=64.10%, 10=5.53%, 20=0.91%, 50=0.02%
  lat (msec)   : 100=0.02%, 250=0.33%
  cpu          : usr=6.67%, sys=28.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.7%, >=64=99.1%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.2%, 8=0.5%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
     issued rwts: total=4135353,4134254,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=538MiB/s (565MB/s), 538MiB/s-538MiB/s (565MB/s-565MB/s), io=15.8GiB (16.9GB), run=30003-30003msec
  WRITE: bw=538MiB/s (564MB/s), 538MiB/s-538MiB/s (564MB/s-564MB/s), io=15.8GiB (16.9GB), run=30003-30003msec

Linux

Same behaviour (though slightly higher speed, but not enough) on linux nodes with the same configuration.

linux VM /dev/sda:

$ fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=165MiB/s,w=166MiB/s][r=42.2k,w=42.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=5083: Mon Mar 11 14:39:24 2024
  read: IOPS=41.3k, BW=161MiB/s (169MB/s)(4844MiB/30013msec)
    slat (nsec): min=1794, max=7176.0k, avg=3447.61, stdev=14679.02
    clat (usec): min=95, max=61056, avg=12404.20, stdev=5252.66
     lat (usec): min=99, max=61062, avg=12407.73, stdev=5252.56
    clat percentiles (usec):
     |  1.00th=[ 1942],  5.00th=[ 4113], 10.00th=[ 6063], 20.00th=[ 8586],
     | 30.00th=[ 9896], 40.00th=[10945], 50.00th=[11994], 60.00th=[12911],
     | 70.00th=[14091], 80.00th=[16188], 90.00th=[19268], 95.00th=[22152],
     | 99.00th=[27919], 99.50th=[30016], 99.90th=[34866], 99.95th=[36963],
     | 99.99th=[42206]
   bw (  KiB/s): min=126600, max=218720, per=99.99%, avg=165255.52, stdev=4775.38, samples=240
   iops        : min=31650, max=54680, avg=41313.80, stdev=1193.85, samples=240
  write: IOPS=41.3k, BW=162MiB/s (169MB/s)(4848MiB/30013msec); 0 zone resets
    slat (nsec): min=1864, max=10598k, avg=3637.47, stdev=15942.32
    clat (usec): min=51, max=67756, avg=12358.53, stdev=5258.65
     lat (usec): min=58, max=67765, avg=12362.26, stdev=5258.56
    clat percentiles (usec):
     |  1.00th=[ 1893],  5.00th=[ 4047], 10.00th=[ 5997], 20.00th=[ 8455],
     | 30.00th=[ 9765], 40.00th=[10945], 50.00th=[11994], 60.00th=[12780],
     | 70.00th=[14091], 80.00th=[16057], 90.00th=[19268], 95.00th=[22152],
     | 99.00th=[27657], 99.50th=[29754], 99.90th=[34866], 99.95th=[36439],
     | 99.99th=[41681]
   bw (  KiB/s): min=125984, max=218256, per=99.99%, avg=165388.25, stdev=4803.97, samples=240
   iops        : min=31496, max=54564, avg=41346.98, stdev=1200.99, samples=240
  lat (usec)   : 100=0.01%, 250=0.03%, 500=0.08%, 750=0.10%, 1000=0.13%
  lat (msec)   : 2=0.76%, 4=3.70%, 10=26.61%, 20=60.16%, 50=8.42%
  lat (msec)   : 100=0.01%
  cpu          : usr=2.56%, sys=9.09%, ctx=855077, majf=0, minf=362
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1240047,1241056,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s), io=4844MiB (5079MB), run=30013-30013msec
  WRITE: bw=162MiB/s (169MB/s), 162MiB/s-162MiB/s (169MB/s-169MB/s), io=4848MiB (5083MB), run=30013-30013msec

Disk stats (read/write):
  sda: ios=1230258/1233558, merge=0/59, ticks=15245610/15213443, in_queue=30459065, util=99.74%

linux VM on /dev/sdb:

$ sudo fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=884MiB/s,w=884MiB/s][r=226k,w=226k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=5315: Mon Mar 11 14:44:04 2024
  read: IOPS=227k, BW=888MiB/s (931MB/s)(26.0GiB/30006msec)
    slat (nsec): min=1903, max=165258, avg=3956.31, stdev=2428.61
    clat (usec): min=101, max=38970, avg=2278.11, stdev=1649.00
     lat (usec): min=106, max=38974, avg=2282.14, stdev=1648.94
    clat percentiles (usec):
     |  1.00th=[ 1020],  5.00th=[ 1647], 10.00th=[ 1729], 20.00th=[ 1827],
     | 30.00th=[ 1893], 40.00th=[ 1958], 50.00th=[ 2008], 60.00th=[ 2073],
     | 70.00th=[ 2147], 80.00th=[ 2245], 90.00th=[ 2474], 95.00th=[ 3523],
     | 99.00th=[ 7701], 99.50th=[15926], 99.90th=[22938], 99.95th=[23462],
     | 99.99th=[24249]
   bw (  KiB/s): min=678033, max=1066016, per=100.00%, avg=909662.47, stdev=17331.74, samples=240
   iops        : min=169508, max=266504, avg=227415.52, stdev=4332.94, samples=240
  write: IOPS=227k, BW=888MiB/s (931MB/s)(26.0GiB/30006msec); 0 zone resets
    slat (nsec): min=1993, max=297325, avg=4342.53, stdev=2561.83
    clat (usec): min=48, max=38642, avg=2216.51, stdev=1645.97
     lat (usec): min=52, max=38645, avg=2220.93, stdev=1645.91
    clat percentiles (usec):
     |  1.00th=[  938],  5.00th=[ 1582], 10.00th=[ 1680], 20.00th=[ 1778],
     | 30.00th=[ 1844], 40.00th=[ 1893], 50.00th=[ 1942], 60.00th=[ 2008],
     | 70.00th=[ 2089], 80.00th=[ 2180], 90.00th=[ 2409], 95.00th=[ 3458],
     | 99.00th=[ 7635], 99.50th=[15795], 99.90th=[22938], 99.95th=[23462],
     | 99.99th=[24249]
   bw (  KiB/s): min=680023, max=1069920, per=100.00%, avg=909058.05, stdev=17212.36, samples=240
   iops        : min=170005, max=267480, avg=227264.42, stdev=4303.11, samples=240
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.10%, 500=0.22%, 750=0.31%
  lat (usec)   : 1000=0.41%
  lat (msec)   : 2=52.39%, 4=42.75%, 10=3.12%, 20=0.34%, 50=0.36%
  cpu          : usr=9.50%, sys=50.41%, ctx=2411385, majf=0, minf=408
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=6823830,6819254,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=888MiB/s (931MB/s), 888MiB/s-888MiB/s (931MB/s-931MB/s), io=26.0GiB (27.9GB), run=30006-30006msec
  WRITE: bw=888MiB/s (931MB/s), 888MiB/s-888MiB/s (931MB/s-931MB/s), io=26.0GiB (27.9GB), run=30006-30006msec

Disk stats (read/write):
  sdb: ios=6820239/6815740, merge=0/236, ticks=14524924/14052098, in_queue=28577021, util=99.73%

Our assumption would be that the speed of those two disks are similar if not the same since the OsDisk is placed on the tmp disk.

Attached is a ARM deployment template which can be used to reproduce it. I attached the linux one, if you like I can also attach the windows one. Make sure to look at parameters.json to adjust it and then apply with:

# make sure to adjust resource group and use an existing one.
az deployment group create --resource-group ephemeraltest --template-file template.json --parameters @parameters.json

parameters.json template.json

Mar 11 '24 15:03 mweibel

FYI we're in contact with Azure support since a few weeks for this issue. No update yet - we haven't had the easiest time to clarify with support that it actually is an issue.

Apr 19 '24 07:04 mweibel