AKS
AKS copied to clipboard
[BUG] Slow ephemeral OS disk compared to regular VM
Describe the bug
We are using an AKS cluster with KEDA to run our Azure Pipelines jobs. As our build jobs are quite IO-intensive, we chose to use Standard_D32ads_v5 machines with Ephemeral OS disks. This allows us to put all workload on the temporary disk of the machines and simplifies our setup.
While conducting benchmarks with the fio commands from the docs to verify IOPS performance, we observed unexpected behavior with the AKS nodes:
$ fio --name=test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
$ fio --name=test --rw=randread --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
$ fio --name=test --rw=randwrite --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
On average, we only achieve 80k IOPS instead of the specified 150k IOPS. This result made us curious, so we re-ran the same commands while directly SSH-ing into one of our nodes to eliminate any pipeline/container/Kubernetes overhead, and we obtained the same results. We then repeated the same tests on a separate Azure VM with the same settings (Standard_D32ads_v5 on Ephemeral OS disk in the same region), and we actually achieved the specified 150k IOPS. It appears that the Ephemeral OS disks on AKS-managed VMs are not as fast as those on regular Azure VMs. Are we missing something?
To Reproduce Steps to reproduce the behavior:
- Setup a AKS cluster with Standard_D32ads_v5 machines and ephemeral OS disks
- Run above commands
- Check avg. IOPS
Expected behavior I expect to achieve the same level of performance on AKS nodes as on regular Azure VMs with ephemeral OS disks.
Screenshots N/A.
Environment (please complete the following information):
- CLI Version: 2.50.0
- Kubernetes version: 1.25.6
Additional context N/A.
hmmm. you're not missing something.
what's the exact configuration when you run the test? are you writing into the pod root overlayfs, host mounting the disk, emptyDir, etc?
what's the exact configuration when you run the test? are you writing into the pod root overlayfs, host mounting the disk, emptyDir, etc?
We use emptyDir for our workload. However, since we initially suspected that the problem might be with overlayfs or other drivers, we used kubectl-exec to get a shell directly on the node to eliminate any overhead.
These are the commands for a minimal reproducible example:
# Create AKS cluster in West Europe
$ az aks create --name myAKSCluster --resource-group myResourceGroup -s Standard_D32ads_v5 --node-osdisk-type Ephemeral --node-osdisk-size 1200
# Get credentials for kubectl
$ az aks get-credentials --name myAKSCluster --resource-group myResourceGroup
Then, use kubectl-exec to get a shell directly on the node:
# Use kubectl-exec to get a shell on the node
$ ./kubectl-exec
root@aks-nodepool1-26530957-vmss000000:/# apt update && apt install fio
root@aks-nodepool1-26530957-vmss000000:/# fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 4 processes
Jobs: 4 (f=4): [m(4)][100.0%][r=160MiB/s,w=158MiB/s][r=40.9k,w=40.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=8515: Fri Jul 14 08:30:54 2023
read: IOPS=40.8k, BW=159MiB/s (167MB/s)(4784MiB/30013msec)
slat (nsec): min=1402, max=24327k, avg=35905.57, stdev=301878.58
clat (usec): min=101, max=92076, avg=12429.66, stdev=9985.28
lat (usec): min=108, max=92079, avg=12465.66, stdev=10006.77
clat percentiles (usec):
| 1.00th=[ 955], 5.00th=[ 1631], 10.00th=[ 2147], 20.00th=[ 3458],
| 30.00th=[ 5342], 40.00th=[ 7570], 50.00th=[10028], 60.00th=[12780],
| 70.00th=[15795], 80.00th=[20055], 90.00th=[26346], 95.00th=[32113],
| 99.00th=[43779], 99.50th=[48497], 99.90th=[58983], 99.95th=[64750],
| 99.99th=[72877]
bw ( KiB/s): min=96016, max=328424, per=100.00%, avg=163342.64, stdev=12166.36, samples=236
iops : min=24004, max=82106, avg=40835.66, stdev=3041.59, samples=236
write: IOPS=40.8k, BW=159MiB/s (167MB/s)(4786MiB/30013msec); 0 zone resets
slat (nsec): min=1502, max=25595k, avg=36374.11, stdev=303592.46
clat (usec): min=62, max=92076, avg=12580.36, stdev=10015.75
lat (usec): min=68, max=92079, avg=12616.83, stdev=10037.27
clat percentiles (usec):
| 1.00th=[ 963], 5.00th=[ 1647], 10.00th=[ 2212], 20.00th=[ 3589],
| 30.00th=[ 5538], 40.00th=[ 7767], 50.00th=[10290], 60.00th=[12911],
| 70.00th=[16057], 80.00th=[20055], 90.00th=[26608], 95.00th=[32113],
| 99.00th=[43779], 99.50th=[48497], 99.90th=[58983], 99.95th=[63701],
| 99.99th=[72877]
bw ( KiB/s): min=95680, max=326320, per=100.00%, avg=163413.15, stdev=12149.27, samples=236
iops : min=23920, max=81580, avg=40853.29, stdev=3037.32, samples=236
lat (usec) : 100=0.01%, 250=0.04%, 500=0.17%, 750=0.33%, 1000=0.57%
lat (msec) : 2=7.11%, 4=14.49%, 10=26.71%, 20=30.52%, 50=19.65%
lat (msec) : 100=0.40%
cpu : usr=2.32%, sys=8.72%, ctx=800877, majf=0, minf=70
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=1224732,1225234,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=4784MiB (5017MB), run=30013-30013msec
WRITE: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=4786MiB (5019MB), run=30013-30013msec
Disk stats (read/write):
sda: ios=1219919/1220277, merge=2/87, ticks=11318790/11224205, in_queue=22543101, util=99.73%
AKS with ephemeral OS disk use VMSS with Standard SSD disks. It could be the cause of the performance issue.
Hmm. Could you please provide a reference? I have not found any mention of this limitation in the documentation and was expecting that it would run on the same SSDs as regular Azure VMs. In fact, they mention this in their blog:
[...] In addition, the ephemeral OS disk will share the IOPS with the temporary storage disk as per the VM size you selected. Ephemeral disks also require that the VM size supports Premium storage. The sizes usually have an s in the name, like DSv2 and EsV3. For more information, see Azure VM sizes for details around which sizes support Premium storage.
[...] This VM Series supports both VM cache and temporary storage SSD. High Scale VMs like DSv2-series that leverage Azure Premium Storage have a multi-tier caching technology called BlobCache. BlobCache uses a combination of the host RAM and local SSD for caching. This cache is available for the Premium Storage persistent disks and VM local disks. The VM cache can be used for hosting an ephemeral OS disk. When a VM series supports the VM cache, its size depends on the VM series and VM size. The VM cache size is indicated in parentheses next to IO throughput ("cache size in GiB").
Not sure if it is explained somewhere, but I have 2 different AKS clusters deployed, one with ephemeral OS disk. Checking its vmss configuration, I can see that the OS disk type attached is: Standard HDD LRS
On the other hand, for the other AKS cluster, the OS disk type attached is: Premium SSD LRS.
Unfortunately, I don't have access to the internal Resource Group created by AKS. This is due to our IT department limiting access permissions, so I can only see the AKS resource in my own resource group. However, I believe this might just be a rendering issue, as I can see the same value with regular VMs. It also says "Standard HDD LRS" but also "150000 IOPS".
the ssd thing is an implementation detail, the backing image is Standard HDD pulled to local disk
this is on my to-do list to repro but dealing with cgroupv2 issues a bit first :)
specifically
We use emptyDir for our workload
this is surprising(ly bad performance for empty dir)
We have some AKS clusters in worse ephemeral disk performance.
# fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.1
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=97.3MiB/s,w=97.5MiB/s][r=24.9k,w=24.0k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=3753471: Fri Feb 9 16:56:17 2024
read: IOPS=27.0k, BW=109MiB/s (115MB/s)(3280MiB/30016msec)
slat (nsec): min=1400, max=85393k, avg=36894.18, stdev=430688.70
clat (usec): min=101, max=404282, avg=18366.00, stdev=17828.60
lat (usec): min=113, max=404284, avg=18403.03, stdev=17831.82
clat percentiles (usec):
| 1.00th=[ 1074], 5.00th=[ 3163], 10.00th=[ 5014], 20.00th=[ 7832],
| 30.00th=[ 9896], 40.00th=[ 11731], 50.00th=[ 13829], 60.00th=[ 16319],
| 70.00th=[ 19792], 80.00th=[ 25035], 90.00th=[ 35390], 95.00th=[ 47449],
| 99.00th=[ 85459], 99.50th=[106431], 99.90th=[179307], 99.95th=[252707],
| 99.99th=[371196]
bw ( KiB/s): min=10024, max=38168, per=25.00%, avg=27975.48, stdev=4753.45, samples=240
iops : min= 2506, max= 9542, avg=6993.83, stdev=1188.36, samples=240
write: IOPS=28.0k, BW=109MiB/s (115MB/s)(3286MiB/30016msec)
slat (nsec): min=1600, max=80643k, avg=65559.22, stdev=532054.65
clat (usec): min=37, max=404190, avg=18097.22, stdev=17720.12
lat (usec): min=61, max=404201, avg=18162.95, stdev=17718.90
clat percentiles (usec):
| 1.00th=[ 938], 5.00th=[ 2933], 10.00th=[ 4752], 20.00th=[ 7570],
| 30.00th=[ 9634], 40.00th=[ 11469], 50.00th=[ 13566], 60.00th=[ 16057],
| 70.00th=[ 19530], 80.00th=[ 24773], 90.00th=[ 34866], 95.00th=[ 47449],
| 99.00th=[ 85459], 99.50th=[104334], 99.90th=[179307], 99.95th=[248513],
| 99.99th=[371196]
bw ( KiB/s): min=10008, max=37992, per=25.01%, avg=28029.28, stdev=4702.18, samples=240
iops : min= 2502, max= 9498, avg=7007.27, stdev=1175.56, samples=240
lat (usec) : 50=0.01%, 100=0.01%, 250=0.09%, 500=0.25%, 750=0.30%
lat (usec) : 1000=0.34%
lat (msec) : 2=1.70%, 4=4.71%, 10=23.97%, 20=39.31%, 50=24.95%
lat (msec) : 100=3.79%, 250=0.54%, 500=0.05%
cpu : usr=2.26%, sys=18.58%, ctx=1159484, majf=0, minf=8305
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwt: total=839618,841158,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=3280MiB (3439MB), run=30016-30016msec
WRITE: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=3286MiB (3445MB), run=30016-30016msec
Disk stats (read/write):
sda: ios=834701/836459, merge=2/156, ticks=10581685/10425060, in_queue=17674896, util=99.71%
Thanks for opening this issue. We experience the same and could reproduce it on standard VMs.
Azure Support Request ID: 2403070050000924
Windows
windows VM (instance type Standard_D96ads_v5
):
"storageProfile": {
"osDisk": {
"createOption": "fromImage",
"diskSizeGB": 2040,
"managedDisk": {
"storageAccountType": "Standard_LRS"
},
"caching": "ReadOnly",
"diffDiskSettings": {
"option": "Local",
"placement": "ResourceDisk"
},
"deleteOption": "Delete"
},
"imageReference": {
"publisher": "MicrosoftWindowsServer",
"offer": "WindowsServer",
"sku": "2019-datacenter-gensecond",
"version": "latest"
}
},
Result on OS disk drive (C:
):
PS C:\Users\redacted> fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=windowsaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=256
...
fio-3.36
Starting 4 threads
test: Laying out IO files (2 files / total 30720MiB)
Jobs: 4 (f=8): [m(4)][100.0%][r=159MiB/s,w=157MiB/s][r=40.7k,w=40.3k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=3460: Wed Mar 6 16:04:01 2024
read: IOPS=37.5k, BW=147MiB/s (154MB/s)(4397MiB/30013msec)
slat (nsec): min=3700, max=90600, avg=6934.13, stdev=1319.90
clat (msec): min=2, max=836, avg=12.93, stdev= 2.65
lat (msec): min=2, max=836, avg=12.93, stdev= 2.65
clat percentiles (msec):
| 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 12], 20.00th=[ 13],
| 30.00th=[ 13], 40.00th=[ 13], 50.00th=[ 13], 60.00th=[ 14],
| 70.00th=[ 14], 80.00th=[ 14], 90.00th=[ 14], 95.00th=[ 15],
| 99.00th=[ 15], 99.50th=[ 16], 99.90th=[ 17], 99.95th=[ 22],
| 99.99th=[ 124]
bw ( KiB/s): min=112901, max=165761, per=100.00%, avg=158036.82, stdev=2210.54, samples=224
iops : min=28223, max=41439, avg=39508.04, stdev=552.64, samples=224
write: IOPS=37.6k, BW=147MiB/s (154MB/s)(4402MiB/30013msec); 0 zone resets
slat (usec): min=4, max=472, avg= 7.32, stdev= 1.42
clat (msec): min=2, max=835, avg=12.87, stdev= 2.74
lat (msec): min=2, max=835, avg=12.88, stdev= 2.74
clat percentiles (msec):
| 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 12], 20.00th=[ 13],
| 30.00th=[ 13], 40.00th=[ 13], 50.00th=[ 13], 60.00th=[ 13],
| 70.00th=[ 14], 80.00th=[ 14], 90.00th=[ 14], 95.00th=[ 15],
| 99.00th=[ 15], 99.50th=[ 16], 99.90th=[ 18], 99.95th=[ 24],
| 99.99th=[ 124]
bw ( KiB/s): min=116413, max=165865, per=100.00%, avg=158249.00, stdev=2121.27, samples=224
iops : min=29102, max=41464, avg=39561.23, stdev=530.30, samples=224
lat (msec) : 4=0.01%, 10=0.10%, 20=99.83%, 50=0.01%, 250=0.05%
lat (msec) : 1000=0.01%
cpu : usr=0.00%, sys=12.50%, ctx=0, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
issued rwts: total=1125614,1127004,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=4397MiB (4611MB), run=30013-30013msec
WRITE: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=4402MiB (4616MB), run=30013-30013msec
Result on tmp disk drive (D:
):
PS D:\> fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=windowsaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=256
...
fio-3.36
Starting 4 threads
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=659MiB/s,w=663MiB/s][r=169k,w=170k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=7160: Wed Mar 6 16:05:24 2024
read: IOPS=138k, BW=538MiB/s (565MB/s)(15.8GiB/30003msec)
slat (usec): min=2, max=5876, avg= 8.82, stdev=49.05
clat (usec): min=76, max=155071, avg=3126.87, stdev=8496.17
lat (usec): min=114, max=155490, avg=3135.70, stdev=8529.38
clat percentiles (usec):
| 1.00th=[ 979], 5.00th=[ 1385], 10.00th=[ 1532], 20.00th=[ 1811],
| 30.00th=[ 2057], 40.00th=[ 2212], 50.00th=[ 2409], 60.00th=[ 2671],
| 70.00th=[ 2802], 80.00th=[ 2933], 90.00th=[ 3228], 95.00th=[ 5211],
| 99.00th=[ 10945], 99.50th=[ 15008], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[152044]
bw ( KiB/s): min= 572, max=940577, per=100.00%, avg=569406.32, stdev=72308.44, samples=228
iops : min= 140, max=235144, avg=142350.51, stdev=18077.13, samples=228
write: IOPS=138k, BW=538MiB/s (564MB/s)(15.8GiB/30003msec); 0 zone resets
slat (usec): min=2, max=6530, avg= 9.46, stdev=50.19
clat (usec): min=8, max=155215, avg=3079.78, stdev=8620.81
lat (usec): min=42, max=156215, avg=3089.24, stdev=8654.57
clat percentiles (usec):
| 1.00th=[ 914], 5.00th=[ 1319], 10.00th=[ 1483], 20.00th=[ 1745],
| 30.00th=[ 1991], 40.00th=[ 2147], 50.00th=[ 2343], 60.00th=[ 2606],
| 70.00th=[ 2737], 80.00th=[ 2868], 90.00th=[ 3163], 95.00th=[ 5145],
| 99.00th=[ 10945], 99.50th=[ 15270], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[152044]
bw ( KiB/s): min= 708, max=942222, per=100.00%, avg=569302.63, stdev=72263.71, samples=228
iops : min= 174, max=235555, avg=142324.65, stdev=18065.98, samples=228
lat (usec) : 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.02%
lat (usec) : 750=0.22%, 1000=1.08%
lat (msec) : 2=27.76%, 4=64.10%, 10=5.53%, 20=0.91%, 50=0.02%
lat (msec) : 100=0.02%, 250=0.33%
cpu : usr=6.67%, sys=28.33%, ctx=0, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.7%, >=64=99.1%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.2%, 8=0.5%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
issued rwts: total=4135353,4134254,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=538MiB/s (565MB/s), 538MiB/s-538MiB/s (565MB/s-565MB/s), io=15.8GiB (16.9GB), run=30003-30003msec
WRITE: bw=538MiB/s (564MB/s), 538MiB/s-538MiB/s (564MB/s-564MB/s), io=15.8GiB (16.9GB), run=30003-30003msec
Linux
Same behaviour (though slightly higher speed, but not enough) on linux nodes with the same configuration.
linux VM /dev/sda:
$ fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=165MiB/s,w=166MiB/s][r=42.2k,w=42.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=5083: Mon Mar 11 14:39:24 2024
read: IOPS=41.3k, BW=161MiB/s (169MB/s)(4844MiB/30013msec)
slat (nsec): min=1794, max=7176.0k, avg=3447.61, stdev=14679.02
clat (usec): min=95, max=61056, avg=12404.20, stdev=5252.66
lat (usec): min=99, max=61062, avg=12407.73, stdev=5252.56
clat percentiles (usec):
| 1.00th=[ 1942], 5.00th=[ 4113], 10.00th=[ 6063], 20.00th=[ 8586],
| 30.00th=[ 9896], 40.00th=[10945], 50.00th=[11994], 60.00th=[12911],
| 70.00th=[14091], 80.00th=[16188], 90.00th=[19268], 95.00th=[22152],
| 99.00th=[27919], 99.50th=[30016], 99.90th=[34866], 99.95th=[36963],
| 99.99th=[42206]
bw ( KiB/s): min=126600, max=218720, per=99.99%, avg=165255.52, stdev=4775.38, samples=240
iops : min=31650, max=54680, avg=41313.80, stdev=1193.85, samples=240
write: IOPS=41.3k, BW=162MiB/s (169MB/s)(4848MiB/30013msec); 0 zone resets
slat (nsec): min=1864, max=10598k, avg=3637.47, stdev=15942.32
clat (usec): min=51, max=67756, avg=12358.53, stdev=5258.65
lat (usec): min=58, max=67765, avg=12362.26, stdev=5258.56
clat percentiles (usec):
| 1.00th=[ 1893], 5.00th=[ 4047], 10.00th=[ 5997], 20.00th=[ 8455],
| 30.00th=[ 9765], 40.00th=[10945], 50.00th=[11994], 60.00th=[12780],
| 70.00th=[14091], 80.00th=[16057], 90.00th=[19268], 95.00th=[22152],
| 99.00th=[27657], 99.50th=[29754], 99.90th=[34866], 99.95th=[36439],
| 99.99th=[41681]
bw ( KiB/s): min=125984, max=218256, per=99.99%, avg=165388.25, stdev=4803.97, samples=240
iops : min=31496, max=54564, avg=41346.98, stdev=1200.99, samples=240
lat (usec) : 100=0.01%, 250=0.03%, 500=0.08%, 750=0.10%, 1000=0.13%
lat (msec) : 2=0.76%, 4=3.70%, 10=26.61%, 20=60.16%, 50=8.42%
lat (msec) : 100=0.01%
cpu : usr=2.56%, sys=9.09%, ctx=855077, majf=0, minf=362
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=1240047,1241056,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s), io=4844MiB (5079MB), run=30013-30013msec
WRITE: bw=162MiB/s (169MB/s), 162MiB/s-162MiB/s (169MB/s-169MB/s), io=4848MiB (5083MB), run=30013-30013msec
Disk stats (read/write):
sda: ios=1230258/1233558, merge=0/59, ticks=15245610/15213443, in_queue=30459065, util=99.74%
linux VM on /dev/sdb:
$ sudo fio --name=test --filename=./test --rw=randrw --bs=4k --direct=1 --ioengine=libaio --iodepth=256 --size=30G --runtime=30 --numjobs=4 --group_reporting
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.16
Starting 4 processes
test: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [m(4)][100.0%][r=884MiB/s,w=884MiB/s][r=226k,w=226k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=5315: Mon Mar 11 14:44:04 2024
read: IOPS=227k, BW=888MiB/s (931MB/s)(26.0GiB/30006msec)
slat (nsec): min=1903, max=165258, avg=3956.31, stdev=2428.61
clat (usec): min=101, max=38970, avg=2278.11, stdev=1649.00
lat (usec): min=106, max=38974, avg=2282.14, stdev=1648.94
clat percentiles (usec):
| 1.00th=[ 1020], 5.00th=[ 1647], 10.00th=[ 1729], 20.00th=[ 1827],
| 30.00th=[ 1893], 40.00th=[ 1958], 50.00th=[ 2008], 60.00th=[ 2073],
| 70.00th=[ 2147], 80.00th=[ 2245], 90.00th=[ 2474], 95.00th=[ 3523],
| 99.00th=[ 7701], 99.50th=[15926], 99.90th=[22938], 99.95th=[23462],
| 99.99th=[24249]
bw ( KiB/s): min=678033, max=1066016, per=100.00%, avg=909662.47, stdev=17331.74, samples=240
iops : min=169508, max=266504, avg=227415.52, stdev=4332.94, samples=240
write: IOPS=227k, BW=888MiB/s (931MB/s)(26.0GiB/30006msec); 0 zone resets
slat (nsec): min=1993, max=297325, avg=4342.53, stdev=2561.83
clat (usec): min=48, max=38642, avg=2216.51, stdev=1645.97
lat (usec): min=52, max=38645, avg=2220.93, stdev=1645.91
clat percentiles (usec):
| 1.00th=[ 938], 5.00th=[ 1582], 10.00th=[ 1680], 20.00th=[ 1778],
| 30.00th=[ 1844], 40.00th=[ 1893], 50.00th=[ 1942], 60.00th=[ 2008],
| 70.00th=[ 2089], 80.00th=[ 2180], 90.00th=[ 2409], 95.00th=[ 3458],
| 99.00th=[ 7635], 99.50th=[15795], 99.90th=[22938], 99.95th=[23462],
| 99.99th=[24249]
bw ( KiB/s): min=680023, max=1069920, per=100.00%, avg=909058.05, stdev=17212.36, samples=240
iops : min=170005, max=267480, avg=227264.42, stdev=4303.11, samples=240
lat (usec) : 50=0.01%, 100=0.01%, 250=0.10%, 500=0.22%, 750=0.31%
lat (usec) : 1000=0.41%
lat (msec) : 2=52.39%, 4=42.75%, 10=3.12%, 20=0.34%, 50=0.36%
cpu : usr=9.50%, sys=50.41%, ctx=2411385, majf=0, minf=408
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=6823830,6819254,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=888MiB/s (931MB/s), 888MiB/s-888MiB/s (931MB/s-931MB/s), io=26.0GiB (27.9GB), run=30006-30006msec
WRITE: bw=888MiB/s (931MB/s), 888MiB/s-888MiB/s (931MB/s-931MB/s), io=26.0GiB (27.9GB), run=30006-30006msec
Disk stats (read/write):
sdb: ios=6820239/6815740, merge=0/236, ticks=14524924/14052098, in_queue=28577021, util=99.73%
Our assumption would be that the speed of those two disks are similar if not the same since the OsDisk is placed on the tmp disk.
Attached is a ARM deployment template which can be used to reproduce it. I attached the linux one, if you like I can also attach the windows one. Make sure to look at parameters.json to adjust it and then apply with:
# make sure to adjust resource group and use an existing one.
az deployment group create --resource-group ephemeraltest --template-file template.json --parameters @parameters.json
FYI we're in contact with Azure support since a few weeks for this issue. No update yet - we haven't had the easiest time to clarify with support that it actually is an issue.
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads