zos icon indicating copy to clipboard operation
zos copied to clipboard

K8s is not working when zvolums are used

Open ashraffouda opened this issue 1 year ago • 3 comments

Describe the bug

Deployment of k8s cluster is broken when zvolumes are used while it is working properly when zmounts are used It gives this error

[+] k3s: time="2024-05-12T12:01:06Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: \"overlayfs\" snapshotter cannot be enabled for \"/mnt/data/agent/containerd\", try using \"fuse-overlayfs\" or \"native\": failed to mount overlay: invalid argument"

To Reproduce

Deploy k8s cluster with zvolumes

ashraffouda avatar May 12 '24 14:05 ashraffouda

If the k8s flist uses the obsolete raw image, the first attached "disk" to the vm MUST be a zmount, not a volume. Extra "volumes" can be added to the VM. Then you can only mount them if the k8s has the virtiofs module.

the right way to do this actually is now modify the k8s image to use the preferred flist style with individual files.

muhamadazmy avatar May 14 '24 11:05 muhamadazmy

k3s image is not a VM, as it doesn't have a kernel with it. However, it turned out that overlayfs has some issues with virtiofs as the upper layer. and since container runtimes usually use overlayfs, basically all of them won't work with the new Volumes.

There are kernel batches for running virtiofs with overlayfs but I believe it will make it harder for users to create custom images with these patches. So, we might need to revise the way Volumes work.

AbdelrahmanElawady avatar May 15 '24 14:05 AbdelrahmanElawady

So the incompatibility between virtiofs and overlayfs has been understood for a while in the context of running Docker inside micro VMs and trying to use the virtiofs based rootfs for Docker's data dir. Docker tends to automatically fall back to the vfs driver and continue operating, but performance is very bad. Placing Docker's data dir on a disk (raw image type) fixes this (and conforms to the intended design of storing user data on a disk/volume). If we intend to deprecate that form in favor of the new virtiofs based volume, then we won't have this workaround.

As suggested in the error message in the original post, using the fuse-overlayfs driver can be another alternative. That probably has better performance than vfs but is still going to be a performance hit over using a non fuse driver. Maybe this could be acceptable for many use cases where performance sensitive data can be stored in a volume attached to the container (since container volumes don't use the same storage driver as used for the container rootfs).

I reviewed the discussions around improving compatibility for virtiofs and overlayfs. For reference, this issue contains the best overview of the situation.

There are kernel patches for running virtiofs with overlayfs

It seems that these patches were merged into the mainline kernel as of 5.7. Seems what we're missing are the other pieces of the puzzle mentioned in this comment on the issue linked above:

# we absolutely need xattr and sys_admin cap
# allow_direct_io just seems sensible but is not required
# we had been using -o writeback which improved performance however users were reporting problems so removed it
virtio_fs_extra_args = ["-o", "xattr", "-o", "modcaps=+sys_admin", "-o", "allow_direct_io"]

Also:

One thing we've figured out (again with help from RHers above) is that to create an overlayfs in virtiofs your bottom layer must not also be overlay -- (e.g. it needs to be ext4, xfs, etc).

Based on my read of https://github.com/threefoldtech/zos/issues/1564, that suggests that our current implementation of rootfs is ruled out, since it's virtiofs backed by overlayfs, but we should be able to get this working for volumes, assuming they are just btrfs underneath.

One question then is whether it's acceptable to give CAP_SYS_ADMIN to virtiofsd.

scottyeager avatar Jul 30 '24 23:07 scottyeager

Has anyone played with this? Does anyone know the impact of CAP_SYS_ADMIN as noted above? Elsewhere it is implied even after all this there is slower performance. I don't recall seeing an explanation for why or what can be done.

Is there anywhere better to start a discussion or understand the state of this? A clear problem area or path to making this all work would help. Otherwise, trying to tackle this requires even know where to start looking.

boombatower avatar Sep 03 '25 02:09 boombatower

Hi @boombatower, so far we are recommending that anyone wanting good performance for overlayfs use a "disk" which just exposes a virtual block device to the VM. See my guide here for example.

I'm not aware of any further work on the virtiofs angle.

scottyeager avatar Sep 03 '25 06:09 scottyeager