zos icon indicating copy to clipboard operation
zos copied to clipboard

ZOS disastrous performance on PCIE 4 NVME SSD

Open archit3kt opened this issue 4 years ago • 12 comments

Hello, following this forum post with no news since a month, I thought it will be a better idea to create this issue here.

Quick summary, ZOS is having terrible PCIE4 SSD performance issue. Here are some fio tests results on actual ZoS :

Random read 4k blocks : 12.4 Mb/s 4142 IOPS Random write 4k blocks : 13.3 Mb/s 4489 IOPS Sequential read 2MB blocks : 1316 Mb/s 864 IOPS Sequential write 2MB blocks : 2326 Mb/s 1528 IOPS

Made some tests on the same machine with Ubuntu 20 and kernel 5.4, same results.

Hopefully performance is very good on Ubuntu when switching to 5.10.x kernels :+1:

Random read 4k blocks : 1855 Mb/s 488 000 IOPS Random write 4k blocks : 563 Mb/s 144 000 IOPS Sequential read 2MB blocks : 6728 Mb/s 3360 IOPS Sequential write 2MB blocks : 6271 Mb/s 3132 IOPS

This answer was given to me :

"It’s not kernel related, if you run fio on your root filesystem of the container, you hit 0-fs , which is not made to be fast, specially for random read/write.

I got it, 0-fs is not meant to be fast, but this slow would still be a big problem for a computer which only have one container running... I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.

Could you have a look please ? I cannot start hosting productions workload with such terrible IO performance...

archit3kt avatar Nov 03 '21 12:11 archit3kt

New tests made on zos v3.0.1-rc3, better but still way below I should get :

Tests are done on a rootfs of Ubuntu zMachine :

Random read 4k blocks : 790 Mb/s 200 000 IOPS Random write 4k blocks : 116 Mb/s 30 000 IOPS Sequential read 2MB blocks : 1850 Mb/s 900 IOPS Sequential write 2MB blocks : 900 Mb/s 450 IOPS

Note performance regression on sequential write...

Tests on disks added to the zMachine and mounted on /data :

Random read 4k blocks : 630 Mb/s 160 000 IOPS Random write 4k blocks : 190 Mb/s 50 000 IOPS Sequential read 2MB blocks : 1200 Mb/s 600 IOPS Sequential write 2MB blocks : 290 Mb/s 140 IOPS

It doesn't make sense ! If added disk should get the native NVME SSD performance, there is clearly a problem somewhere ! Could someone please explain how the storage framework on zos v3 works ?

archit3kt avatar Nov 07 '21 10:11 archit3kt

@maxux please take a look at it

xmonader avatar Nov 07 '21 10:11 xmonader

I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.

Just to be clear about this part, what you mean is that you mounted a volume under the container /data then ran the fio tests on this location /data ?

muhamadazmy avatar Nov 08 '21 08:11 muhamadazmy

For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.

What happen behind the scene for V3:

  • When creating a zmount (or a disk) a raw disk file is allocated on SSD (the disk is formatted as btrfs)
  • The disk is then attached to cloud-hypervisor process as raw disk.
  • The disk is auto-mounted on the configured location in your deplolyment

So IO operations go through this,

  • In the VM (the cloud-hypervisor process) the operation is dealt with with the btrfs module in the VM kernel
  • The disk IO operation then go through the VirtIO driver to the host machine (ZOS)
  • The write operation is then handled again by the btrfs driver on the host
  • Then at the end written to physical disk

Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer

muhamadazmy avatar Nov 08 '21 08:11 muhamadazmy

Just to be clear about this part, what you mean is that you mounted a volume under the container /data then ran the fio tests on this location /data ?

Yes, you got it

Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer

Thanks for the explanation. Indeed the architectural choice you made is not the best for IO performance ! It would be great to allow logical volume creation and mount inside the VMs (at least for power users who'd like to get all the performance from their hardware). I would be glad to be a tester for this use case !

For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.

If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?

archit3kt avatar Nov 08 '21 10:11 archit3kt

If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?

Yes, ZOS has a unified workload type called ZMACHINE which is always (under the hood) a VM. If your flist is a container (let's say an ubuntu flist) we inject a custom build kernel+initramfs and still start the "container" as a full VM. this insure 100% separation from the ZOS host, and control over amount of CPU+Memory allocated to your resource. For the user he still can perfectly access and run his processes inside this "container" normally.

When you start a k8s node on zos, it's basically a well crafted "flist" with the k8s well configured and ready to start. for ZOS it's just another VM that it runs same way as a container (this makes code much simpler)

muhamadazmy avatar Nov 08 '21 10:11 muhamadazmy

Which image do you run exactly ? Default zos runs a kernel 5.4, there is a 5.10 also available. Can you give me the nodded ?

maxux avatar Nov 08 '21 12:11 maxux

My first post was done with kernel 5.4 in grid v2 My second post was done with latest zos for grid v3, I saw kernel 5.12 inside the VMs

node id is 68, IP is 2a02:842a:84c8:c601:d250:99ff:fedf:924d (ICMP is blocked, but IPv6 firewall allows everything else)

archit3kt avatar Nov 08 '21 13:11 archit3kt

I confirm, your node is running 5.10.55 kernel, which is the latest we support officially. The limitation probably the VM like Azmy said yep.

maxux avatar Nov 08 '21 15:11 maxux

FYI I automated my fio tests and launched it simultaneously on X ubuntu VMs

Each 4 VMs have exactly the same results as a launch with only 1 VM

I see a degradation of performance per VM when I launch the test on 8 VM

My guess is that it is a vfio limitation, could be good to know if you make some performance tweaks someday

Still, sequential write is disastrous with vfio, and I don't have a clue why...

archit3kt avatar Nov 24 '21 19:11 archit3kt

this will have to wait, we have other things to first do.

despiegk avatar Feb 07 '22 04:02 despiegk

Hello Team, can we have an update on this, please?

amandacaster avatar Mar 01 '23 07:03 amandacaster