ZFS module/units fail to load on boot with Beta 4426.1.0
Description
Unlike the current Stable, Beta 4426.1.0 fails to load the ZFS kernel module on boot (when pool devices are available), resulting in ZFS units being skipped and no pools mounted.
Impact
ZFS pools won't be re-mounted automatically, which would be unexpected.
Environment and steps to reproduce
- Given the following Butane config to create a ZFS pool and dataset on first boot:
variant: flatcar
version: 1.1.0
storage:
files:
- path: /etc/flatcar/enabled-sysext.conf
mode: 0644
contents:
inline: |
zfs
systemd:
units:
- name: format-zfs.service
enabled: true
contents: |
[Unit]
ConditionFirstBoot=1
Before=first-boot-complete.target
Wants=first-boot-complete.target
[Service]
Type=oneshot
ExecStart=zpool create zdata /dev/vdb
ExecStart=zfs create -o mountpoint=/zfs-test zdata/zfs-test
[Install]
WantedBy=multi-user.target
- Compile to
ignition.jsonand test using QEMU:
# Create an extra blank file device:
qemu-img create -f qcow2 zfs-disk.qcow2 100M
# Start Flatcar
flatcar_production_qemu.sh -i ignition.json -- \
-nographic \
-drive file=zfs-disk.qcow2,if=virtio,format=qcow2
- Both Stable and Beta 4426.1.0 will create the
zdatapool withzfs-testdataset mounted at/zfs-teston first boot -- as the ZFS module is dynamically loaded when zfs/zpool commands are run. Can verify with:
zfs list
df /zfs-test
- Now reboot the Flatcar VM:
-
Stable will load the ZFS module automatically via udev seeing ZFS devices and ZFS units are run resulting in
/zfs-testbeing mounted. - Beta 4426.1.0 will fail loading the ZFS module with the following logged:
-
Stable will load the ZFS module automatically via udev seeing ZFS devices and ZFS units are run resulting in
Oct 08 17:25:18 localhost (udev-worker)[1754]: vdb1: Process '/sbin/modprobe zfs' failed with exit.
And all ZFS units will be skipped due to unmet condition (ConditionPathIsDirectory=/sys/module/zfs), resulting in ZFS not being loaded and /zfs-data not being automatically mounted. You can verify with df /zfs-test or see the failures via journalctl |grep zfs Running a manual zfs list will trigger dynamic loading of the ZFS module and mount the dataset again.
Expected behavior
Beta should automatically start ZFS module and run units when ZFS devices are present.
Additional information
Please add any information here that does not fit the above format.
It's strange. It won't load on boot, and the error will reappear if you run sudo udevadm trigger -s block, but if you run /sbin/modprobe zfs manually, it works fine.
Figured it out. It's the SystemCallFilter on the systemd-udevd.service unit. Adding this works around it:
/etc/systemd/system/systemd-udevd.service.d/syscall.conf
[Service]
SystemCallFilter=
We need to figure out what syscalls are required for doing our modules overlay trick.
Thank you both for the report and investigation (and beta testing)!
I've just tested it and we need to add @mount syscalls to make it work (in addition to the ones already allowed in systemd-udevd.service @system-service @module @raw-io bpf)
It's probably cumulative, so I think you can just add @mount in a drop-in.
I've just opened a PR: https://github.com/flatcar/scripts/pull/3367 I've done local testing and it seems to work, we should also add testcase.
Hi, maybe we are missing something that needs to happen after an upgrade?
coreos-prd1-mysql-px-c ~ # cat /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=4459.2.0
VERSION_ID=4459.2.0
BUILD_ID=2025-11-10-1432
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 4459.2.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:4459.2.0:*:*:*:*:*:*:*"
coreos-prd1-mysql-px-c ~ # ls -al /usr/lib/systemd/system/systemd-udevd.service.d/
total 2
drwxr-xr-x. 2 root root 34 Nov 10 15:32 .
drwxr-xr-x. 1 root root 44 Jan 1 1970 ..
-rw-r--r--. 1 root root 36 Nov 10 15:32 10-zfs.conf
coreos-prd1-mysql-px-c ~ #
The flatcar.conf file in /usr/lib/systemd/system/systemd-udevd.service.d/ from PR: https://github.com/flatcar/scripts/pull/3367 is missing and therefore we are missing our ZFS pools in production.
Do we need to run something after an upgrade to get that sysext up2date?
Thanks Rainer
The fix hasn't been backported to stable (or beta) yet. I will do that today for the next release.
So we need to halt the roll out of the "stable" release, roll back to the previous version and disable auto update for ever - is this the correct way to move forward?
This contradicts https://www.flatcar.org/releases:
The Stable channel is intended for use in production clusters. Versions of Flatcar Container Linux have been tested as they move through Alpha and Beta channels before being promoted to stable.
Would it be safer for us to move to LTS for production systems?
This fix should have been backported before the last set of releases, and I am also disappointed that this didn't happen. Lesson learned.
Our test suite does cover ZFS for every release, but unfortunately, it didn't catch this particular issue. We will be extending the test suite so that it does.
I have now backported the fix to 4459 for the next stable and beta releases. We have decided to expedite these releases, so you should see them on Monday 24th.
I don't know exactly how you manage your deployments, but if you have already rolled back, you can pause automatic updates until then. If you don't want to wait that long, you can manually add the following at /etc/systemd/system/systemd-udevd.service.d/zfs-udevd-hotfix.conf before rebooting into the broken release. You can remove it after upgrading to the next release, but you don't have to.
[Service]
SystemCallFilter=@mount
You are free to switch to LTS if you'd prefer, but what you have seen is not the level of stability we strive for.
Now I am just interested if we are the only ones affected by this issue using ZFS with Flatcar in production. If so we might just have to transition to btrfs because of the possible higher popularity...
Anyone else affected by this issue please use the thumbs up reaction button on this post.
@stumbaumr we aren't in production yet, but will be using ZFS. We do plan on having dev/staging servers here on beta though to keep an eye on things.
@Codelica the issue was caught in beta, but the process took to long and it was not fixed before the stable release got out. So your plan would have not helped you here... We used to run beta as well on a test cluster. But only by using it you can identify issues. We are just a small group of people - there is no one available to operate/fight systems on a beta level.