zfs icon indicating copy to clipboard operation
zfs copied to clipboard

`nvlist_lookup_string ("path"): Cannot allocate memory` when running `grub-install` while resilvering

Open bitonic opened this issue 3 years ago • 4 comments

System information

Type Version/Name
Distribution Name NixOs
Distribution Version 22.05
Kernel Version 5.15.63
Architecture x86_64
OpenZFS Version 2.1.5-1

Describe the problem you're observing

I don't know if this is a bug, but I think it's worth reporting. One of my drives failed. I did zpool replace to replace it, and the resilvering process started. At the same time, I started to install GRUB on the same new disk, through NixOS's "mirrored boots" option, for boot redundancy (I actually described the whole process here https://mazzo.li/posts/hetzner-zfs.html).

In any case, the grub-install invocation looks something like this:

/nix/store/viw4gss1cdqd80kyjz6izsxafvcxadgr-grub-2.06/sbin/grub-install --recheck --root-directory=/tmp/OW6QUrdNk4 /dev/sdb --target=i386-pc 
Installing for i386-pc platform. 
/nix/store/viw4gss1cdqd80kyjz6izsxafvcxadgr-grub-2.06/sbin/grub-install: nvlist_lookup_string ("path"): Cannot allocate memory 

NixOS first has grub-install to write the files to a tmp dir and then move the files in the boot partition. As you can see above, I get some internal ZFS errror about not being able to allocate some internal data structure (I think, anyway).

I speculated that the fact that resilvering was going on might have been what broke it (since grub-install on the other disks worked), and indeed once the resilvering finished it worked. But it's still weird behavior, so reporting it here.

bitonic avatar Sep 18 '22 14:09 bitonic

Oh just to be clear: my /boot-* partitions are not on ZFS. So I'm not entirely sure how the ZFS filesystem is involved apart from it being used to write the files generated by grub-install.

bitonic avatar Sep 18 '22 14:09 bitonic

A quick search narrows this down to only a few places where this could happen:

lib/libzutil/zutil_import.c:    if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &path) != 0)
lib/libzutil/zutil_import.c:    error = nvlist_lookup_string(nvroot, ZPOOL_CONFIG_PATH, &val);
lib/libzutil/zutil_import.c:    if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &path) == 0) {
lib/libzutil/os/linux/zutil_import_os.c:        if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &path) != 0)
lib/libzutil/os/linux/zutil_import_os.c:        if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &path) != 0)
lib/libzfs/libzfs_pool.c:           nvlist_lookup_string(tgt, ZPOOL_CONFIG_PATH, &pathname) == 0) {
lib/libzfs/libzfs_pool.c:       } else if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &tpath) == 0) {
lib/libzfs/os/linux/libzfs_pool_os.c:   if (nvlist_lookup_string(config, ZPOOL_CONFIG_PATH, &path) != 0)
lib/libzpool/util.c:            if (nvlist_lookup_string(cnv, ZPOOL_CONFIG_PATH, &cname) &&

That being said, the kernel rejecting memory allocations when low on memory is a normal thing (although it is rare). If you do echo 3 | sudo tee /proc/sys/vm/drop_caches, it should work.

ryao avatar Sep 19 '22 03:09 ryao

I believe you're barking up the wrong tree, since that error comes from (I believe) grub-core/osdep/unix/getroot.c:

      if (nvlist_lookup_string (children[i], "path", &device) != 0)
        error (1, errno, "nvlist_lookup_string (\"path\")");

So I think you'd need to peer into libnvpair or what's in those objects at that time in GRUB to debug this.

rincebrain avatar Sep 19 '22 04:09 rincebrain

I believe you're barking up the wrong tree, since that error comes from (I believe) grub-core/osdep/unix/getroot.c:

      if (nvlist_lookup_string (children[i], "path", &device) != 0)
        error (1, errno, "nvlist_lookup_string (\"path\")");

So I think you'd need to peer into libnvpair or what's in those objects at that time in GRUB to debug this.

You are right. I had assumed that it was linking against libzfs and triggering this in our code. Instead, it links to both libnvpair and libzfs. It then triggers this in its code.

That said, the grub code is wrong to use errno since the return value is the error. The ENOMEM came from somewhere else in the grub code.

ryao avatar Sep 19 '22 05:09 ryao

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 15 '23 15:10 stale[bot]

Hello,

I am quite in a similar situation, which I solved:

  • NixOS host
  • ZFS pool, with one dead vdev
  • ZFS (unsollicited) resilver forcibly paused, which then triggered an unsollicited (paused) scrub
  • An error nvlist_lookup_string ("path"): Cannot allocate memory when installing Grub
  • A boot and efi partitions which are NOT on ZFS (fat32 partitions)
  • I’m using [boot.loader.grub.mirroredBoots](https://search.nixos.org/options?channel=unstable&from=0&size=50&sort=relevance&type=packages&query=mirroredBoots) on NixOS

The error appeared after I switched to mirroredBoots on NixOS, and at the next system boot. /boot is no longer a mountpoint to my boot partition.

Here’s the strace grub-install output I was able to catch:

…
access("/dev/zfs", F_OK)                = 0
openat(AT_FDCWD, "/dev/zfs", O_RDWR|O_EXCL|O_CLOEXEC) = 4
openat(AT_FDCWD, "/dev/zfs", O_RDWR|O_CLOEXEC) = 5
openat(AT_FDCWD, "/sys/module/zfs/properties.dataset", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
newfstatat(6, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
getdents64(6, 0x867dd0 /* 99 entries */, 32768) = 3280
getdents64(6, 0x867dd0 /* 0 entries */, 32768) = 0
close(6)                                = 0
openat(AT_FDCWD, "/sys/module/zfs/properties.pool", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
newfstatat(6, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
getdents64(6, 0x867dd0 /* 38 entries */, 32768) = 1192
getdents64(6, 0x867dd0 /* 0 entries */, 32768) = 0
close(6)                                = 0
openat(AT_FDCWD, "/sys/module/zfs/features.pool", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
newfstatat(6, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
getdents64(6, 0x867dd0 /* 41 entries */, 32768) = 1936
getdents64(6, 0x867dd0 /* 0 entries */, 32768) = 0
close(6)                                = 0
openat(AT_FDCWD, "/sys/module/zfs/properties.vdev", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
newfstatat(6, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
getdents64(6, 0x867dd0 /* 50 entries */, 32768) = 1544
getdents64(6, 0x867dd0 /* 0 entries */, 32768) = 0
close(6)                                = 0
brk(0x898000)                           = 0x898000
ioctl(4, ZFS_IOC_POOL_STATS, 0x7ffd8607a870) = -1 ENOMEM (Cannot allocate memory)
brk(0x888000)                           = 0x888000
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe856cb5000
ioctl(4, ZFS_IOC_POOL_STATS, 0x7ffd8607a870) = 0
brk(0x8a9000)                           = 0x8a9000
munmap(0x7fe856cb5000, 139264)          = 0
newfstatat(AT_FDCWD, "/dev/disk/by-id/ata-ST12…Z5H4-part1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x51), ...}, 0) = 0
write(2, "/nix/store/vmvflds3p010s8kx6fgm2"..., 77/nix/store/vmvflds3p010s8kx6fgm2yc7vfip0bmw-grub-2.12-rc1/sbin/grub-install: ) = 77
write(2, "nvlist_lookup_string (\"path\")", 29nvlist_lookup_string ("path")) = 29
write(2, ": Cannot allocate memory", 24: Cannot allocate memory) = 24
write(2, "\n", 1
)                       = 1
close(4)                                = 0
close(5)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

It looks like grub-install tries to check which device /boot is in, and, as it is on ZFS (/boot no longer exists, as I replaced it as /boot1 and /boot2), tries to fetch statistics to the partition of /boot using ioctl:

ioctl(4, ZFS_IOC_POOL_STATS, 0x7ffd8607a870) = -1 ENOMEM (Cannot allocate memory)

I don’t get why grub-install has this behavior, neither why ioctl(…, ZFS_IOC_POOL_STATS, …) fails with ENOMEM, but restoring a valid /boot fixed it for me:

# mv /boot{,.old}
# ln -s /boot1 /boot

I believe mirroredBoots NixOS option would need a workaround there too.

bLuka avatar May 27 '24 14:05 bLuka