zos
zos copied to clipboard
ZOS disastrous performance on PCIE 4 NVME SSD
Hello, following this forum post with no news since a month, I thought it will be a better idea to create this issue here.
Quick summary, ZOS is having terrible PCIE4 SSD performance issue. Here are some fio tests results on actual ZoS :
Random read 4k blocks : 12.4 Mb/s 4142 IOPS Random write 4k blocks : 13.3 Mb/s 4489 IOPS Sequential read 2MB blocks : 1316 Mb/s 864 IOPS Sequential write 2MB blocks : 2326 Mb/s 1528 IOPS
Made some tests on the same machine with Ubuntu 20 and kernel 5.4, same results.
Hopefully performance is very good on Ubuntu when switching to 5.10.x kernels :+1:
Random read 4k blocks : 1855 Mb/s 488 000 IOPS Random write 4k blocks : 563 Mb/s 144 000 IOPS Sequential read 2MB blocks : 6728 Mb/s 3360 IOPS Sequential write 2MB blocks : 6271 Mb/s 3132 IOPS
This answer was given to me :
"It’s not kernel related, if you run fio on your root filesystem of the container, you hit 0-fs , which is not made to be fast, specially for random read/write.
I got it, 0-fs is not meant to be fast, but this slow would still be a big problem for a computer which only have one container running... I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.
Could you have a look please ? I cannot start hosting productions workload with such terrible IO performance...
New tests made on zos v3.0.1-rc3, better but still way below I should get :
Tests are done on a rootfs of Ubuntu zMachine :
Random read 4k blocks : 790 Mb/s 200 000 IOPS Random write 4k blocks : 116 Mb/s 30 000 IOPS Sequential read 2MB blocks : 1850 Mb/s 900 IOPS Sequential write 2MB blocks : 900 Mb/s 450 IOPS
Note performance regression on sequential write...
Tests on disks added to the zMachine and mounted on /data :
Random read 4k blocks : 630 Mb/s 160 000 IOPS Random write 4k blocks : 190 Mb/s 50 000 IOPS Sequential read 2MB blocks : 1200 Mb/s 600 IOPS Sequential write 2MB blocks : 290 Mb/s 140 IOPS
It doesn't make sense ! If added disk should get the native NVME SSD performance, there is clearly a problem somewhere ! Could someone please explain how the storage framework on zos v3 works ?
@maxux please take a look at it
I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.
Just to be clear about this part, what you mean is that you mounted a volume under the container /data then ran the fio tests on this location /data ?
For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.
What happen behind the scene for V3:
- When creating a
zmount(or a disk) a raw disk file is allocated on SSD (the disk is formatted as btrfs) - The disk is then attached to cloud-hypervisor process as raw disk.
- The disk is auto-mounted on the configured location in your deplolyment
So IO operations go through this,
- In the VM (the cloud-hypervisor process) the operation is dealt with with the btrfs module in the VM kernel
- The disk IO operation then go through the VirtIO driver to the host machine (ZOS)
- The write operation is then handled again by the btrfs driver on the host
- Then at the end written to physical disk
Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer
Just to be clear about this part, what you mean is that you mounted a
volumeunder the container/datathen ran the fio tests on this location/data?
Yes, you got it
Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer
Thanks for the explanation. Indeed the architectural choice you made is not the best for IO performance ! It would be great to allow logical volume creation and mount inside the VMs (at least for power users who'd like to get all the performance from their hardware). I would be glad to be a tester for this use case !
For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.
If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?
If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?
Yes, ZOS has a unified workload type called ZMACHINE which is always (under the hood) a VM. If your flist is a container (let's say an ubuntu flist) we inject a custom build kernel+initramfs and still start the "container" as a full VM. this insure 100% separation from the ZOS host, and control over amount of CPU+Memory allocated to your resource. For the user he still can perfectly access and run his processes inside this "container" normally.
When you start a k8s node on zos, it's basically a well crafted "flist" with the k8s well configured and ready to start. for ZOS it's just another VM that it runs same way as a container (this makes code much simpler)
Which image do you run exactly ? Default zos runs a kernel 5.4, there is a 5.10 also available. Can you give me the nodded ?
My first post was done with kernel 5.4 in grid v2 My second post was done with latest zos for grid v3, I saw kernel 5.12 inside the VMs
node id is 68, IP is 2a02:842a:84c8:c601:d250:99ff:fedf:924d (ICMP is blocked, but IPv6 firewall allows everything else)
I confirm, your node is running 5.10.55 kernel, which is the latest we support officially. The limitation probably the VM like Azmy said yep.
FYI I automated my fio tests and launched it simultaneously on X ubuntu VMs
Each 4 VMs have exactly the same results as a launch with only 1 VM
I see a degradation of performance per VM when I launch the test on 8 VM
My guess is that it is a vfio limitation, could be good to know if you make some performance tweaks someday
Still, sequential write is disastrous with vfio, and I don't have a clue why...
this will have to wait, we have other things to first do.
Hello Team, can we have an update on this, please?