zos
zos copied to clipboard
K8s is not working when zvolums are used
Describe the bug
Deployment of k8s cluster is broken when zvolumes are used while it is working properly when zmounts are used It gives this error
[+] k3s: time="2024-05-12T12:01:06Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: \"overlayfs\" snapshotter cannot be enabled for \"/mnt/data/agent/containerd\", try using \"fuse-overlayfs\" or \"native\": failed to mount overlay: invalid argument"
To Reproduce
Deploy k8s cluster with zvolumes
If the k8s flist uses the obsolete raw image, the first attached "disk" to the vm MUST be a zmount, not a volume. Extra "volumes" can be added to the VM. Then you can only mount them if the k8s has the virtiofs module.
the right way to do this actually is now modify the k8s image to use the preferred flist style with individual files.
k3s image is not a VM, as it doesn't have a kernel with it. However, it turned out that overlayfs has some issues with virtiofs as the upper layer. and since container runtimes usually use overlayfs, basically all of them won't work with the new Volumes.
There are kernel batches for running virtiofs with overlayfs but I believe it will make it harder for users to create custom images with these patches. So, we might need to revise the way Volumes work.
So the incompatibility between virtiofs and overlayfs has been understood for a while in the context of running Docker inside micro VMs and trying to use the virtiofs based rootfs for Docker's data dir. Docker tends to automatically fall back to the vfs driver and continue operating, but performance is very bad. Placing Docker's data dir on a disk (raw image type) fixes this (and conforms to the intended design of storing user data on a disk/volume). If we intend to deprecate that form in favor of the new virtiofs based volume, then we won't have this workaround.
As suggested in the error message in the original post, using the fuse-overlayfs driver can be another alternative. That probably has better performance than vfs but is still going to be a performance hit over using a non fuse driver. Maybe this could be acceptable for many use cases where performance sensitive data can be stored in a volume attached to the container (since container volumes don't use the same storage driver as used for the container rootfs).
I reviewed the discussions around improving compatibility for virtiofs and overlayfs. For reference, this issue contains the best overview of the situation.
There are kernel patches for running virtiofs with overlayfs
It seems that these patches were merged into the mainline kernel as of 5.7. Seems what we're missing are the other pieces of the puzzle mentioned in this comment on the issue linked above:
# we absolutely need xattr and sys_admin cap
# allow_direct_io just seems sensible but is not required
# we had been using -o writeback which improved performance however users were reporting problems so removed it
virtio_fs_extra_args = ["-o", "xattr", "-o", "modcaps=+sys_admin", "-o", "allow_direct_io"]
Also:
One thing we've figured out (again with help from RHers above) is that to create an overlayfs in virtiofs your bottom layer must not also be overlay -- (e.g. it needs to be ext4, xfs, etc).
Based on my read of https://github.com/threefoldtech/zos/issues/1564, that suggests that our current implementation of rootfs is ruled out, since it's virtiofs backed by overlayfs, but we should be able to get this working for volumes, assuming they are just btrfs underneath.
One question then is whether it's acceptable to give CAP_SYS_ADMIN to virtiofsd.
Has anyone played with this? Does anyone know the impact of CAP_SYS_ADMIN as noted above? Elsewhere it is implied even after all this there is slower performance. I don't recall seeing an explanation for why or what can be done.
Is there anywhere better to start a discussion or understand the state of this? A clear problem area or path to making this all work would help. Otherwise, trying to tackle this requires even know where to start looking.
Hi @boombatower, so far we are recommending that anyone wanting good performance for overlayfs use a "disk" which just exposes a virtual block device to the VM. See my guide here for example.
I'm not aware of any further work on the virtiofs angle.