mergerfs MergerFS mount randomly disappears, only displays ??? when listed

Describe the bug

MergerFS mount seem to randomly disappear, and just give back "cannot access '/Storage': Input/output error" when trying to ls the filesystem root. At this point I need to restart the mergerfs service for it to reappear. However this means I have to re-export my NFS point, which in turn means I have to remount or restart my services that use it.

I'm having a really hard time narrowing down what could cause it, to the point that even now I don't have any idea why it happens. But it has been happening since I implemented MergerFS 1-2 months ago. For context my storage is on a Proxmox box, that runs one LXC container with my postgresql server, one VM for Jellyfin and several VMs that act as K3S nodes. The MergerFS mount is accessed through NFS in both the Jellyfin VM and in all K3S nodes.

I have 4 disks, all with EXT4 FS-s that are all mounted under /mnt as disk1-4 . These are then merged and mounted under /Storage.

To Reproduce

As mentioned it is really random, however a scheduled backup that runs at midnight in Proxmox seems to be the most reliable way. Weirdly even that fails at random points, sometimes it manages to completely save the backup and the mount dies after the backup ends and sends my notification email. But I had instances where it disappeared mid-process.

I also had it disappear while using Radarr or Sonarr to import media, but those are not a reliable way to reproduce I have found.

Expected behavior

Function as expected. Shouldn't disappear and break NFS

System information:

OS, kernel version: Linux server 6.5.11-7-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-7 (2023-12-05T09:44Z) x86_64 GNU/Linux
mergerfs version: mergerfs v2.38.0
mergerfs settings

[Unit]
Description=Mergerfs service

[Service]
Type=simple
KillMode=control-group
ExecStart=/usr/bin/mergerfs \
 -f \
 -o cache.files=partial,moveonenospc=true,category.create=mfs,dropcacheonclose=true,posix_acl=true,noforget,inodecalc=path-hash,fsname=mergerfs \
 /mnt/disk* \
 /Storage
ExecStop=/bin/fusermount -uz /Storage
Restart=on-failure

[Install]
WantedBy=default.target

List of drives, filesystems, & sizes:
- df -h

Filesystem            Size  Used Avail Use% Mounted on
udev                   16G     0   16G   0% /dev
tmpfs                 3.2G  2.9M  3.2G   1% /run
/dev/mapper/pve-root   28G   12G   15G  45% /
tmpfs                  16G   34M   16G   1% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
efivarfs              128K   50K   74K  41% /sys/firmware/efi/efivars
/dev/sdf              3.6T   28K  3.4T   1% /mnt/disk4
/dev/sde              3.6T   28K  3.4T   1% /mnt/disk3
/dev/sda               19T  2.2T   16T  13% /mnt/disk1
/dev/sdb               19T  1.3T   16T   8% /mnt/disk2
/dev/fuse             128M   20K  128M   1% /etc/pve
tmpfs                 3.2G     0  3.2G   0% /run/user/0
mergerfs               44T  3.4T   38T   9% /Storage

lsblk -f

NAME                            FSTYPE      FSVER    LABEL      UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
sda                             ext4        1.0      20T_disk_1 3bea15fe-0c62-42ad-bc73-727c7e6ed147     15.1T    12% /mnt/disk1
sdb                             ext4        1.0      20T_disk_2 9306d268-2f54-42f3-958b-d8555b470bf0     15.9T     7% /mnt/disk2
sdc
├─sdc1
├─sdc2                          vfat        FAT32               EDFC-8E51
└─sdc3                          LVM2_member LVM2 001            a9c81x-tUS5-CcN9-3w5u-ZF84-ODxd-r21cM9
  ├─pve-swap                    swap        1                   3a923fc4-0c8f-4ba3-92a8-b3515283e669                  [SWAP]
  ├─pve-root                    ext4        1.0                 ecd841a9-5d7b-4d70-a575-448fb85d8f51     14.2G    43% /
  ├─pve-data_tmeta
  │ └─pve-data-tpool
  │   └─pve-data
  └─pve-data_tdata
    └─pve-data-tpool
      └─pve-data
sdd
├─sdd1                          ext4        1.0      BigBackup  29c4fd80-9fc5-4d1d-a783-cba4372cffc0
└─sdd2                          LVM2_member LVM2 001            pB5Bes-ARIU-XsLl-ryLc-Nw1A-Ofnj-OEzODe
  ├─vmdata-bigthin_tmeta
  │ └─vmdata-bigthin-tpool
  │   ├─vmdata-bigthin
  │   ├─vmdata-vm--101--disk--0
  │   ├─vmdata-vm--102--disk--0
  │   ├─vmdata-vm--103--disk--0
  │   ├─vmdata-vm--104--disk--0
  │   ├─vmdata-vm--105--disk--0
  │   ├─vmdata-vm--111--disk--0
  │   ├─vmdata-vm--107--disk--0
  │   └─vmdata-vm--100--disk--1 ext4        1.0                 a2234f63-38da-43fb-877a-a3e836f4004e
  └─vmdata-bigthin_tdata
    └─vmdata-bigthin-tpool
      ├─vmdata-bigthin
      ├─vmdata-vm--101--disk--0
      ├─vmdata-vm--102--disk--0
      ├─vmdata-vm--103--disk--0
      ├─vmdata-vm--104--disk--0
      ├─vmdata-vm--105--disk--0
      ├─vmdata-vm--111--disk--0
      ├─vmdata-vm--107--disk--0
      └─vmdata-vm--100--disk--1 ext4        1.0                 a2234f63-38da-43fb-877a-a3e836f4004e
sde                             ext4        1.0      4T_disk_1  06207dd1-fc54-4faf-805d-a880dc432bc4      3.4T     0% /mnt/disk3
sdf                             ext4        1.0      4T_disk_2  2b06a9fa-901c-4b66-bfdc-8c7e4a09f21f      3.4T     0% /mnt/disk4

A strace of the application having a problem: Unable to provide due to how the command was run (scheduler)
strace of mergerfs while app tried to do it's thing: (logfile was too large, had to zip it) mergerfs.trace.zip

Additional context

My NFS export: /Storage *(rw,sync,fsid=0,no_root_squash,no_subtree_check,crossmnt)

All disks have gone through a long selftest using smartctl and report no problems. Example output of the first 20TB disk:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       726         -

Jan 06 '24 00:01 gogo199432

When was the strace of mergerfs taken? While in that broken state? If so... mergerfs looks fine.

Preferably you would trace just before issuing a request to mergerfs mount and then trace the app you use to generate that error. mergerfs is a proxy. The kernel handles requests between the client app and mergerfs. There are lots of situations where the kernel short circuits the communication and therefore not sending anything to it. Which I suspect is happening (for whatever reason). mergerfs would only return EIO if the underlying filesystem did. And the kernel could return it for numerous reasons. And NFS and FUSE don't always play nice together.

Are you modifying the mergerfs pool on the host? Not through NFS?

Jan 06 '24 00:01 trapexit

I started this trace right before I knew the backup would start in the hope that I would catch the issue. After checking it a minute or so later I saw that the mount disappeared and stopped the trace.

Not quite sure what you mean by modifying the pool on the host, but I didn't touch the system while the backup was running. In Proxmox you can add the backup target as both a Directory and NFS mount, but I tried both and didn't seem to have made a difference. During the trace it was set up as a Directory.

I'm pretty sure that I have tried killing NFS completely and even then the backup would randomly fail, but this was a bit ago, I'm not sure anymore.

Jan 06 '24 00:01 gogo199432

NFS does not like out of band changes. Particularly NFSv4. If you have a NFS export and then on the host you export from modify the filesystem straight... you can and will causes problems. It usually leads to stale errors but could depend on the situation.

Tracing as you did is fine but since you aren't providing a matching trace of anything trying to interact with the filesystem on the host (or through NFS) I can't pinpoint who is responsible for the error which is critical. Even if you traced mergerfs after the failure starts and then trace "ls" or "stat" accessing /Storage it would answer that question.

Jan 06 '24 00:01 trapexit

So it finally crashed again, I managed to get both a strace with 'ls' and 'stat', hopefully this helps something.

ls.strace.txt mergerfs-ls.trace.txt

mergerfs-stat.trace.txt statfolder.strace.txt

On a related note, before I got my 2 big HDD-s I was running the 4TB disks on ZFS in a mirror and used the "sharenfs" toggle of ZFS to expose the folders I needed. The caveat there is that at least the top-most folders were all like separate Volumes or whatever ZFS calls them, so they were technically separate filesystems. I wonder if it wasn't breaking because of that? That's the only thing I can think of unless you can find something in the logs above.

Jan 07 '24 00:01 gogo199432

Hmm.. Well the "stat" clearly worked though you didn't stat the file you statfs'ed it. But the statfs clearly worked and you can see the request in mergerfs. The ls however failed with EIO when it tried to stat /Storage and I don't see any evidence of mergerfs receiving the request. However, that trace shows a file being read. Totally Spies. So something is able to interact with it.

Have you tried disabling all caching? If the kernel caches things it becomes more difficult to debug.

Jan 09 '24 02:01 trapexit

It also fails with cache.files=off, but I can set it to that and do another 'ls' trace if that helps. Is there any other caching that I'm not aware of that I can disable?

Jan 09 '24 11:01 gogo199432

See the caching section of the docs. Turn off entry and attr too. And yes, could help to trace that setup.

Jan 09 '24 13:01 trapexit

Got a new trace with the following mergerfs settings: -o cache.files=off,cache.entry=0,cache.attr=0,moveonenospc=true,category.create=mfs,dropcacheonclose=true,posix_acl=true,noforget,inodecalc=path-hash,fsname=mergerfs

ls.strace.txt mergerfs.trace.txt

Jan 09 '24 14:01 gogo199432

Is there anything more I can provide to help deduce the issue? It is still occuring sadly. The frequence seems to depend on how many applications are trying to use it. Also, the issue occured when using Jellyfin, so it also happens after a strictly read-only operation.

Jan 17 '24 13:01 gogo199432

If I knew I'd ask for it.

The kernel isn't forwarding the request.

vs

33344 15:35:19.359147 writev(4, [{iov_base="`\0\0\0\0\0\0\0\264K\373\0\0\0\0\0", iov_len=16}, {iov_base="0\242R\266\2\0\0\0\252\372\304~\2\0\0\0\26\205\327[\2\0\0\0\0\300}A\0\0\0\0\17\312{A\0\0\0\0\0\20\0\0\377\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=80}], 2) = 96 <0.000007>
33344 15:35:19.359172 read(4,  <unfinished ...>
33343 15:35:20.129934 <... read resumed>"8\0\0\0\3\0\0\0\266K\373\0\0\0\0\0\t\0\0\0\0\0\0\0\350\3\0\0\350\3\0\0b\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1052672) = 56 <1.623557>

The fact that read succeeds means some messages are coming in from the kernel. In the very least statfs requesting info on the mount. But there is no log indicating why the kernel behaves that way.

You have a rather complex setup. All I can suggest is making it less so. It's not feasible for me recreate what you have. Create a second pool. 1 branch. Keep everything simple. See if you can break it via standard tooling. There are many variables here. We can't keep using the full stack to debug if nothing obvious presents itself.

Jan 17 '24 13:01 trapexit

Hi,

I'm experiencing a similar issue on the same platform as OP (as far as I can tell from his shared outputs): Proxmox VE. PVE version: pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve)

I used to have a VM under Promox running MergerFS (with an HBA with PCIe passthrough to the VM).

Last week I changed this so my Proxmox host runs MergerFS with the disks attached directly to the motherboard instead of using the HBA. (Server is way less power hungry that way.)

After a day or 2 I noticed the my services, who rely on the MergerFS mount, stopped working. ls of the mount resulted in Input/Output error. Killing MergerFS and running mount -a (I'm using a MergerFS line in my /etc/fstab) again is the only solution.

Today the problem occured again. I was using MergerFS version 2.33.5 from the debian repo's that are configured in Proxmox. I just updated to version 2.39.0 (latest release) but don't expect this to be the solution.

In my previous setup (with the VM), the last version I was using was `2.21.0'.

Any steps I can take to help us troubleshoot the problem? I love MergerFS, have been using it for years (in the VM setup I explained above). I'd like to keep using it but losing the mount every 2 days is not an option of course.

dmesg reports no issues with the relevant disks.

Thanks in advance for the help!

Feb 05 '24 20:02 Janbong

Any steps I can take to help us troubleshoot the problem?

The same as I describe in this thread and the docs.

Feb 05 '24 20:02 trapexit

Quick update from my side. I tried to reduce it to the best of my abilities. First of all, I made a second mount only for Proxmox backups. This mount never had any issues, so we can cross that out. Second, I modified my existing MergerFS mount so it only contained a single disk. In addition removed all services except for Jellyfin.

So setup was:

Disk 1 MergerFS mount -> Shared over NFS -> Jellyfin VM, mounted folders using fstab

This seemed the most stable, but still failed. When I clicked on Jellyfin to scan my library from zero it managed all the way to like 96% or something, and suddenly the mount vanished. This seems like the most "reliable" way to repro, managed to trigger it twice in a row. Still no luck in triggering it using any basic linux command.

Having said all this, after 3-4 months of this issue persisting and with no light at the end of the tunnel I decided to throw money at the problem, bought another disk and moved back over to ZFS. Would love to use MergerFS in the setup I described in the original post, but having my main storage disappear from my production systems is not a state to be in long-term.

Feb 05 '24 20:02 gogo199432

@Janbong And are you using NFS too?

@gogo199432 Are your mergerfs and NFS settings the same as the original post? Have you tried modifying NFS settings? Do you have other FUSE filesystems mounted? Are you doing any modifying of the export outside NFS?

Feb 05 '24 20:02 trapexit

I'm also relying on NFS to let services running on VMs access the MergerFS mount. But that was also the case when MergerFS was running on one of the VMs, which worked without any issues for years .

Feb 05 '24 20:02 Janbong

Yes, but was it the same kernel? Same NFS config? Same NFS version?

Feb 05 '24 20:02 trapexit

Different kernel. Now using pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve). Ubuntu 18.04 VM which worked for years was running kernel 4.15.0-213-generic
Exact same NFS config, copied it from the VM to the PVE host when I migrated.

Feb 05 '24 20:02 Janbong

What were the versions? What are the settings?

Feb 05 '24 20:02 trapexit

I was using the same settings as previously except for reducing to single disk. Apart from what we talked about with caching I have not been modifying the NFS settings. I tried removing the whole posix settings, but that has made no difference. If you mean another system apart from MergerFS that uses FUSE, then no I do not have anything else. As far as I can tell I have been doing no modifications outside of NFS, as I have said the only out-of-band thing was the backup but I moved that to a second MergerFS mount. So at the time of the issue with Jellyfin, there was nothing running on Proxmox that would access the mount.

Feb 05 '24 20:02 gogo199432

What were the versions? What are the settings?

Looking for a way to check the version. I think both are enabled looking at output of nfsstat -s:

root@pve1:~# nfsstat -s
Server rpc stats:
calls      badcalls   badfmt     badauth    badclnt
2625399    0          0          0          0

Server nfs v3:
null             getattr          setattr          lookup           access
6         0%     844071   32%     204       0%     797       0%     150711    5%
readlink         read             write            create           mkdir
0         0%     1206289  45%     2833      0%     88        0%     4         0%
symlink          mknod            remove           rmdir            rename
0         0%     0         0%     0         0%     0         0%     0         0%
link             readdir          readdirplus      fsstat           fsinfo
0         0%     0         0%     4324      0%     414536   15%     8         0%
pathconf         commit
4         0%     1503      0%

Server nfs v4:
null             compound
4         6%     54       93%

Server nfs v4 operations:
op0-unused       op1-unused       op2-future       access           close
0         0%     0         0%     0         0%     2         1%     0         0%
commit           create           delegpurge       delegreturn      getattr
0         0%     0         0%     0         0%     0         0%     36       24%
getfh            link             lock             lockt            locku
4         2%     0         0%     0         0%     0         0%     0         0%
lookup           lookup_root      nverify          open             openattr
2         1%     0         0%     0         0%     0         0%     0         0%
open_conf        open_dgrd        putfh            putpubfh         putrootfh
0         0%     0         0%     34       23%     0         0%     8         5%
read             readdir          readlink         remove           rename
0         0%     0         0%     0         0%     0         0%     0         0%
renew            restorefh        savefh           secinfo          setattr
0         0%     0         0%     0         0%     0         0%     0         0%
setcltid         setcltidconf     verify           write            rellockowner
0         0%     0         0%     0         0%     0         0%     0         0%
bc_ctl           bind_conn        exchange_id      create_ses       destroy_ses
0         0%     0         0%     4         2%     2         1%     2         1%
free_stateid     getdirdeleg      getdevinfo       getdevlist       layoutcommit
0         0%     0         0%     0         0%     0         0%     0         0%
layoutget        layoutreturn     secinfononam     sequence         set_ssv
0         0%     0         0%     4         2%     44       30%     0         0%
test_stateid     want_deleg       destroy_clid     reclaim_comp     allocate
0         0%     0         0%     2         1%     2         1%     0         0%
copy             copy_notify      deallocate       ioadvise         layouterror
0         0%     0         0%     0         0%     0         0%     0         0%
layoutstats      offloadcancel    offloadstatus    readplus         seek
0         0%     0         0%     0         0%     0         0%     0         0%
write_same
0         0%

Settings:

root@pve1:~# cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
#               to NFS clients.  See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes       hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes  gss/krb5i(rw,sync,no_subtree_check)
#
/mnt/storage redacted_ip/24(rw,async,no_subtree_check,fsid=0) redacted_ip/8(rw,async,no_subtree_check,fsid=0) redacted_ip/24(rw,async,no_subtree_check,fsid=0)
/mnt/seed redacted_ip/24(rw,async,no_subtree_check,fsid=1) redacted_ip/8(rw,async,no_subtree_check,fsid=1) redacted_ip/24(rw,async,no_subtree_check,fsid=1)

Feb 05 '24 20:02 Janbong

Client side reporting that it's using NFSv3:

21:02:00 in ~ at k8s ➜ cat /proc/mounts | grep nfs
fs1.bongers.lan:/mnt/storage /mnt/storage nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,na
.4.90,mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip
fs1.bongers.lan:/mnt/seed /mnt/seed nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=2
mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip

Feb 05 '24 21:02 Janbong

The pattern I'm seeing is Proxmox + NFS. While I certainly won't rule out a mergerfs bug the fact everyone who experiences this seems to have that setup suggests it could be the proxmox kernel. I'm going to reach out to the fuse community to see if anyone has any ideas.

Feb 05 '24 21:02 trapexit

Some notes:

I setup a dedicated machine running latest Armbian (x86), mounted a SSD formatted with ext4, put a mergerfs mount over it, exported it and from another machine mounted that export. I used a stress tool I wrote (bbf) to hammer the NFS mount for several hours. No issues.
I will setup Proxmox and attempt the same. If the stress test fails I'll try adding some media and installing Plex I guess. For some reason Proxmox doesn't like my extra machine I have been testing with so I'll try with a VM i guess.
I think what is happening is that NFS is getting into a bad state and causing mergerfs (or kernel side of the relationship) to get into a bad state due to ... something... maybe metadata changing under it in a way it doesn't like. For the root of the mount. And then that "sticks" and therefore no new requests can be made because the root is lookup is failing. I was under the impression that the kernel does not keep forever that error state but maybe I'm wrong or maybe it is triggering something different than normal. Either way I will probably need to fabricate errors to see if it behaves similarly to what you all are seeing.

All that said: NFS and FUSE just do not play nicely together. There are fundamental issues with how the two interact. Even if that were fixed it would still be complicated to get it working flawlessly on my end. I've gotten some feedback from kernel devs on the topic and will write something up about it later once I test some things out but I think after this I'm going to have to recommend people not use NFS or at least "you're on your own". I'll offer suggestions on setup to minimize issues but at the end of the day 100% support is not likely.

Feb 08 '24 21:02 trapexit

Looking over the kernel code I've narrowed down situations where EIO errors can be returned and are "sticky". I think I can add some debug code to detect the situation and then maybe we can figure out what is happening.

Feb 09 '24 06:02 trapexit

I appreciate you looking into this this thoroughly. Thanks!

In the mean time, the problem occurred two more times on my end.

Once the "local" mount itself was working, but the NFS mounts on my VM returned error "stale file error".
Just a couple of hours ago the full error like described above: Input/Output error on the NFS server side as well as the client VMs mounting the exported share.

I have disks spinning down automatically. I found in the logs that some time before the error occurred, all disks started spinning up, which leads me to believe some kind of scan was initiated. Plex scan possible, Sonarr/Radarr scan. Something like that probably. Which reads a lot of files in a short period of time. Not sure how long before the actual error occurred as I am basing myself of off the a timestamp when I got a message from someone using a service and it suddenly stopped working for them.

To be completely transparent on how I set things up: my (3) data disks are XFS, my SnapRAID parity is ext4. Though the parity disk is not really relevant here, I guess, since it is not part of the MergerFS mount.

I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all. My bet is Proxmox doing something MergerFS can't handle actually. It's a little too coincidental that @gogo199432 is also running MergerFS on Proxmox and running into the same issue.

My previous setup was MergerFS on a Ubuntu VM, albeit with a HBA in between as opposed to directly connected drives to the motherboard using SATA in my current setup on Proxmox.

Feb 09 '24 10:02 Janbong

What is your mergerfs config? And are you changing things out of band? Those two errors are pretty different things.

Feb 09 '24 12:02 trapexit

I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all.

Yes, but both mergerfs and the kernel and NFS code had evolved. The kernel code has gotten more strict about certain security concerns in and after 5.14.

If the kernel marks a node bad (literally called fuse_make_bad(inode)) there is nothing I can do. Hence why I'm trying to understand how NFS is triggering the root to be marked as such. Because this is not a unique situation. NFS shouldn't cause an issue any more than normal use. If there is a bug in the kernel that is leading to this I likely can't do much about it till addressed by the kernel devs. Or Proxmox updates their kernel.

Feb 09 '24 13:02 trapexit

my /etc/fstab (MergerFS settings):

/mnt/disk* /mnt/storage fuse.mergerfs direct_io,defaults,allow_other,noforget,use_ino,minfreespace=50G,fsname=mergerfs 0 0

Feb 11 '24 21:02 Janbong

@Janbong As mentioned in the docs direct_io, allow_other, and use_ino are deprecated. And for NFS usage you should really be setting inodecalc=path-hash for consistent inodes.

Feb 11 '24 21:02 trapexit

Would it be possible to try building the nfsdebug branch? Or tell me what version of debian proxmox is based on and I can build packages.

I put in debugging info that will be printed to syslog (journalctl) in cases that the kernel would normally mark things as errored.

Feb 12 '24 00:02 trapexit

mergerfs mergerfs copied to clipboard

MergerFS mount randomly disappears, only displays ??? when listed

mergerfs
mergerfs copied to clipboard