core icon indicating copy to clipboard operation
core copied to clipboard

zpool cache not respected by opnsense

Open nt1 opened this issue 1 year ago • 6 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/src/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/src/issues?q=is%3Aissue

Describe the bug

zpools other than zroot present in zpool cache (/etc/zfs/zpool.cache or /boot/zfs/zpool.cache) are not imported at boot time in OPNsense 24.1.9.

To Reproduce

Steps to reproduce the behavior:

  1. Access command line of device and get a root shell.
  2. Create a new zpool.
  3. Export the zpool (zpool export <NAME>)
  4. Import the zpool (zpool import <NAME>)
  5. Verify the zpool exists in zpool cache (zdb -U /etc/zfs/zpool.cache)
  6. Reboot the system.
  7. When the system is back up, access the command line and check if the new zpool was imported (zpool list). Notice only zroot exists.

Expected behavior

Upon reboot, the newly created zpool was imported.

Describe alternatives you considered

It is possible to manually import the zpool using zpool import after the reboot, but this doesn't really scale.

I was able to get the pool to import as expected by adding zfs_enable="YES" to /etc/rc.conf which allowed /etc/rc.d/zfs to start at boot per expectations. I notice opnsense sets this variable in /usr/local/etc/rc.loader.d/20-zfs, perhaps this variable isn't making it to freebsd's rc.d?

Screenshots

If applicable, add screenshots to help explain your problem.

Relevant log files

If applicable, information from log files supporting your claim.

Additional context

Add any other context about the problem here.

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 24.1.9

nt1 avatar Jun 23 '24 18:06 nt1

The code you may want to improve is this:

https://github.com/opnsense/core/blob/0f73da02ad205aba1be2d7928c72fb8805b34eb4/src/etc/rc#L153-L160

Do not confuse _load with _enable vars. We don’t use the RC subsystem that much as it tends to interfere with the boot sequence.

fichtner avatar Jun 23 '24 20:06 fichtner

Adding -c CACHE_FILE to the referenced zpool import and removing changes to /etc/rc.conf fixes my issue. However, I'm fairly certain that adding additional pools worked in a previous release of OPNSense with no changes.

The behavior of zpool import -a changed when OpenZFS was adopted in FreeBSD 13. In prior releases, -a would default to searching for pools in /dev: https://github.com/freebsd/freebsd-src/blob/b5ad6b488d9e62d820fe90fdce4aee4f4d3d7162/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c#L2634

This changes in OpenZFS, used in 13+, the environment variable ZPOOL_IMPORT_PATH is used as the search path: https://github.com/freebsd/freebsd-src/blob/5fe9c9de03ef3191d216964bc4d8e427d5ed5720/sys/contrib/openzfs/cmd/zpool/zpool_main.c#L3492

nt1 avatar Jun 24 '24 10:06 nt1

Thanks for looking into this. Feel free to provide a PR, but I won't be able to review and commit in the next two weeks.

fichtner avatar Jun 24 '24 11:06 fichtner

(I'd say that ZPOOL_IMPORT_PATH might be the better path forward as I've avoided -c in the past for portability.)

fichtner avatar Jun 24 '24 11:06 fichtner

I'll open a PR to fix this this evening. I can work around this issue for now, so no rush on the review.

nt1 avatar Jun 24 '24 23:06 nt1

I will propose a change that imports pools from zpool cache at boot (just like vanilla FreeBSD) rather than allowing the implementation of zpool import to dictate what's imported.

Pros of using zpool cache

  1. It's explicit, fstab like behavior.
  2. Consistency across reboots, the system boots with the same pools it had at shutdown.
  3. Prevents importing potentially problematic pools that are connected to the system, but have never been successfully imported.
  4. On appliances, may help deter/prevent tampering with root fs.

Cons of using zpool cache

  1. The existence and purpose of zpool cache does not appear to be widely known or extensively documented. This may surprise some people.
  2. Boot behavior is changing. The end user may see a change in available pools after booting this release (easily resolved by importing missing pools and exporting those that shouldn't be there).
  3. The cache file should be present in every existing install. If, for any reason, the cache file is missing, the system will not boot properly.
  4. The path of the cachefile could change in future FreeBSD/OpenZFS releases (which would require a fix to rc).

nt1 avatar Jun 26 '24 18:06 nt1

Experiment is 701dff45b2f, likely in 24.7.2.

fichtner avatar Aug 13 '24 13:08 fichtner

Adding here for reference - this did not go particularly well for some users (as in - unbootable box due to kernel panic) and I'm wondering in which other corner cases this will rear its ugly head.

https://forum.opnsense.org/index.php?topic=42373.msg209269#msg209269 https://forum.opnsense.org/index.php?topic=42387.msg209391#msg209391

doktornotor avatar Aug 22 '24 14:08 doktornotor

I'm not sure about the second one. I'm happy to debug this. Though the peripheral nature of leading into panics is a bit astonishing. I still don't know why zpool-import -a doesn't consider looking for devices to be part of "-a". That seems like a harmful oversight.

fichtner avatar Aug 22 '24 14:08 fichtner

Well, guess I'm out of this race. In hunt for the most vintage HW in the closet, I found this Fujitsu server but that one has ATI ES1000 32MB RAM PCI on board and no AGP slot. 😢 😭

And yeah, I'm also absolutely puzzled as for what does zpool import have to do with AGP graphics.

doktornotor avatar Aug 22 '24 15:08 doktornotor

It’s a working theory. I don’t mind reverting if necessary but we also need to reproduce this remotely without fatalities. In theory we can install the 24.7.1 debug kernel in a working 24.7.1 and then do a tainted zpool import to get a vmcore….

fichtner avatar Aug 22 '24 16:08 fichtner

Meh, OK - the AGP theory is apparently confirmed. Even set hint.agp.0.disabled=1 in loader prompt got some people back into bootable system - 1, 2

doktornotor avatar Aug 24 '24 13:08 doktornotor

Good idea with the device hints helping people to overcome this manually.

fichtner avatar Aug 24 '24 15:08 fichtner