mergerfs
mergerfs copied to clipboard
MergerFS mount randomly disappears, only displays ??? when listed
Describe the bug
MergerFS mount seem to randomly disappear, and just give back "cannot access '/Storage': Input/output error" when trying to ls the filesystem root. At this point I need to restart the mergerfs service for it to reappear. However this means I have to re-export my NFS point, which in turn means I have to remount or restart my services that use it.
I'm having a really hard time narrowing down what could cause it, to the point that even now I don't have any idea why it happens. But it has been happening since I implemented MergerFS 1-2 months ago. For context my storage is on a Proxmox box, that runs one LXC container with my postgresql server, one VM for Jellyfin and several VMs that act as K3S nodes. The MergerFS mount is accessed through NFS in both the Jellyfin VM and in all K3S nodes.
I have 4 disks, all with EXT4 FS-s that are all mounted under /mnt as disk1-4 . These are then merged and mounted under /Storage.
To Reproduce
As mentioned it is really random, however a scheduled backup that runs at midnight in Proxmox seems to be the most reliable way. Weirdly even that fails at random points, sometimes it manages to completely save the backup and the mount dies after the backup ends and sends my notification email. But I had instances where it disappeared mid-process.
I also had it disappear while using Radarr or Sonarr to import media, but those are not a reliable way to reproduce I have found.
Expected behavior
Function as expected. Shouldn't disappear and break NFS
System information:
- OS, kernel version:
Linux server 6.5.11-7-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-7 (2023-12-05T09:44Z) x86_64 GNU/Linux - mergerfs version:
mergerfs v2.38.0 - mergerfs settings
[Unit]
Description=Mergerfs service
[Service]
Type=simple
KillMode=control-group
ExecStart=/usr/bin/mergerfs \
-f \
-o cache.files=partial,moveonenospc=true,category.create=mfs,dropcacheonclose=true,posix_acl=true,noforget,inodecalc=path-hash,fsname=mergerfs \
/mnt/disk* \
/Storage
ExecStop=/bin/fusermount -uz /Storage
Restart=on-failure
[Install]
WantedBy=default.target
- List of drives, filesystems, & sizes:
df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 2.9M 3.2G 1% /run
/dev/mapper/pve-root 28G 12G 15G 45% /
tmpfs 16G 34M 16G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 128K 50K 74K 41% /sys/firmware/efi/efivars
/dev/sdf 3.6T 28K 3.4T 1% /mnt/disk4
/dev/sde 3.6T 28K 3.4T 1% /mnt/disk3
/dev/sda 19T 2.2T 16T 13% /mnt/disk1
/dev/sdb 19T 1.3T 16T 8% /mnt/disk2
/dev/fuse 128M 20K 128M 1% /etc/pve
tmpfs 3.2G 0 3.2G 0% /run/user/0
mergerfs 44T 3.4T 38T 9% /Storage
lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda ext4 1.0 20T_disk_1 3bea15fe-0c62-42ad-bc73-727c7e6ed147 15.1T 12% /mnt/disk1
sdb ext4 1.0 20T_disk_2 9306d268-2f54-42f3-958b-d8555b470bf0 15.9T 7% /mnt/disk2
sdc
├─sdc1
├─sdc2 vfat FAT32 EDFC-8E51
└─sdc3 LVM2_member LVM2 001 a9c81x-tUS5-CcN9-3w5u-ZF84-ODxd-r21cM9
├─pve-swap swap 1 3a923fc4-0c8f-4ba3-92a8-b3515283e669 [SWAP]
├─pve-root ext4 1.0 ecd841a9-5d7b-4d70-a575-448fb85d8f51 14.2G 43% /
├─pve-data_tmeta
│ └─pve-data-tpool
│ └─pve-data
└─pve-data_tdata
└─pve-data-tpool
└─pve-data
sdd
├─sdd1 ext4 1.0 BigBackup 29c4fd80-9fc5-4d1d-a783-cba4372cffc0
└─sdd2 LVM2_member LVM2 001 pB5Bes-ARIU-XsLl-ryLc-Nw1A-Ofnj-OEzODe
├─vmdata-bigthin_tmeta
│ └─vmdata-bigthin-tpool
│ ├─vmdata-bigthin
│ ├─vmdata-vm--101--disk--0
│ ├─vmdata-vm--102--disk--0
│ ├─vmdata-vm--103--disk--0
│ ├─vmdata-vm--104--disk--0
│ ├─vmdata-vm--105--disk--0
│ ├─vmdata-vm--111--disk--0
│ ├─vmdata-vm--107--disk--0
│ └─vmdata-vm--100--disk--1 ext4 1.0 a2234f63-38da-43fb-877a-a3e836f4004e
└─vmdata-bigthin_tdata
└─vmdata-bigthin-tpool
├─vmdata-bigthin
├─vmdata-vm--101--disk--0
├─vmdata-vm--102--disk--0
├─vmdata-vm--103--disk--0
├─vmdata-vm--104--disk--0
├─vmdata-vm--105--disk--0
├─vmdata-vm--111--disk--0
├─vmdata-vm--107--disk--0
└─vmdata-vm--100--disk--1 ext4 1.0 a2234f63-38da-43fb-877a-a3e836f4004e
sde ext4 1.0 4T_disk_1 06207dd1-fc54-4faf-805d-a880dc432bc4 3.4T 0% /mnt/disk3
sdf ext4 1.0 4T_disk_2 2b06a9fa-901c-4b66-bfdc-8c7e4a09f21f 3.4T 0% /mnt/disk4
- A strace of the application having a problem: Unable to provide due to how the command was run (scheduler)
- strace of mergerfs while app tried to do it's thing: (logfile was too large, had to zip it) mergerfs.trace.zip
Additional context
My NFS export:
/Storage *(rw,sync,fsid=0,no_root_squash,no_subtree_check,crossmnt)
All disks have gone through a long selftest using smartctl and report no problems. Example output of the first 20TB disk:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 726 -
When was the strace of mergerfs taken? While in that broken state? If so... mergerfs looks fine.
Preferably you would trace just before issuing a request to mergerfs mount and then trace the app you use to generate that error. mergerfs is a proxy. The kernel handles requests between the client app and mergerfs. There are lots of situations where the kernel short circuits the communication and therefore not sending anything to it. Which I suspect is happening (for whatever reason). mergerfs would only return EIO if the underlying filesystem did. And the kernel could return it for numerous reasons. And NFS and FUSE don't always play nice together.
Are you modifying the mergerfs pool on the host? Not through NFS?
I started this trace right before I knew the backup would start in the hope that I would catch the issue. After checking it a minute or so later I saw that the mount disappeared and stopped the trace.
Not quite sure what you mean by modifying the pool on the host, but I didn't touch the system while the backup was running. In Proxmox you can add the backup target as both a Directory and NFS mount, but I tried both and didn't seem to have made a difference. During the trace it was set up as a Directory.
I'm pretty sure that I have tried killing NFS completely and even then the backup would randomly fail, but this was a bit ago, I'm not sure anymore.
NFS does not like out of band changes. Particularly NFSv4. If you have a NFS export and then on the host you export from modify the filesystem straight... you can and will causes problems. It usually leads to stale errors but could depend on the situation.
Tracing as you did is fine but since you aren't providing a matching trace of anything trying to interact with the filesystem on the host (or through NFS) I can't pinpoint who is responsible for the error which is critical. Even if you traced mergerfs after the failure starts and then trace "ls" or "stat" accessing /Storage it would answer that question.
So it finally crashed again, I managed to get both a strace with 'ls' and 'stat', hopefully this helps something.
ls.strace.txt mergerfs-ls.trace.txt
mergerfs-stat.trace.txt statfolder.strace.txt
On a related note, before I got my 2 big HDD-s I was running the 4TB disks on ZFS in a mirror and used the "sharenfs" toggle of ZFS to expose the folders I needed. The caveat there is that at least the top-most folders were all like separate Volumes or whatever ZFS calls them, so they were technically separate filesystems. I wonder if it wasn't breaking because of that? That's the only thing I can think of unless you can find something in the logs above.
Hmm.. Well the "stat" clearly worked though you didn't stat the file you statfs'ed it. But the statfs clearly worked and you can see the request in mergerfs. The ls however failed with EIO when it tried to stat /Storage and I don't see any evidence of mergerfs receiving the request. However, that trace shows a file being read. Totally Spies. So something is able to interact with it.
Have you tried disabling all caching? If the kernel caches things it becomes more difficult to debug.
It also fails with cache.files=off, but I can set it to that and do another 'ls' trace if that helps. Is there any other caching that I'm not aware of that I can disable?
See the caching section of the docs. Turn off entry and attr too. And yes, could help to trace that setup.
Got a new trace with the following mergerfs settings:
-o cache.files=off,cache.entry=0,cache.attr=0,moveonenospc=true,category.create=mfs,dropcacheonclose=true,posix_acl=true,noforget,inodecalc=path-hash,fsname=mergerfs
Is there anything more I can provide to help deduce the issue? It is still occuring sadly. The frequence seems to depend on how many applications are trying to use it. Also, the issue occured when using Jellyfin, so it also happens after a strictly read-only operation.
If I knew I'd ask for it.
The kernel isn't forwarding the request.
44419 15:35:19.951799 statx(AT_FDCWD, "/Storage", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|STATX_MTIME|STATX_SIZE, 0x7ffcf234e990) = -1 EIO (Input/output error) <0.000006>
vs
33344 15:35:19.359147 writev(4, [{iov_base="`\0\0\0\0\0\0\0\264K\373\0\0\0\0\0", iov_len=16}, {iov_base="0\242R\266\2\0\0\0\252\372\304~\2\0\0\0\26\205\327[\2\0\0\0\0\300}A\0\0\0\0\17\312{A\0\0\0\0\0\20\0\0\377\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=80}], 2) = 96 <0.000007>
33344 15:35:19.359172 read(4, <unfinished ...>
33343 15:35:20.129934 <... read resumed>"8\0\0\0\3\0\0\0\266K\373\0\0\0\0\0\t\0\0\0\0\0\0\0\350\3\0\0\350\3\0\0b\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1052672) = 56 <1.623557>
The fact that read succeeds means some messages are coming in from the kernel. In the very least statfs requesting info on the mount. But there is no log indicating why the kernel behaves that way.
You have a rather complex setup. All I can suggest is making it less so. It's not feasible for me recreate what you have. Create a second pool. 1 branch. Keep everything simple. See if you can break it via standard tooling. There are many variables here. We can't keep using the full stack to debug if nothing obvious presents itself.
Hi,
I'm experiencing a similar issue on the same platform as OP (as far as I can tell from his shared outputs): Proxmox VE.
PVE version: pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve)
I used to have a VM under Promox running MergerFS (with an HBA with PCIe passthrough to the VM).
Last week I changed this so my Proxmox host runs MergerFS with the disks attached directly to the motherboard instead of using the HBA. (Server is way less power hungry that way.)
After a day or 2 I noticed the my services, who rely on the MergerFS mount, stopped working.
ls of the mount resulted in Input/Output error. Killing MergerFS and running mount -a (I'm using a MergerFS line in my /etc/fstab) again is the only solution.
Today the problem occured again. I was using MergerFS version 2.33.5 from the debian repo's that are configured in Proxmox.
I just updated to version 2.39.0 (latest release) but don't expect this to be the solution.
In my previous setup (with the VM), the last version I was using was `2.21.0'.
Any steps I can take to help us troubleshoot the problem? I love MergerFS, have been using it for years (in the VM setup I explained above). I'd like to keep using it but losing the mount every 2 days is not an option of course.
dmesg reports no issues with the relevant disks.
Thanks in advance for the help!
Any steps I can take to help us troubleshoot the problem?
The same as I describe in this thread and the docs.
Quick update from my side. I tried to reduce it to the best of my abilities. First of all, I made a second mount only for Proxmox backups. This mount never had any issues, so we can cross that out. Second, I modified my existing MergerFS mount so it only contained a single disk. In addition removed all services except for Jellyfin.
So setup was:
Disk 1 MergerFS mount -> Shared over NFS -> Jellyfin VM, mounted folders using fstab
This seemed the most stable, but still failed. When I clicked on Jellyfin to scan my library from zero it managed all the way to like 96% or something, and suddenly the mount vanished. This seems like the most "reliable" way to repro, managed to trigger it twice in a row. Still no luck in triggering it using any basic linux command.
Having said all this, after 3-4 months of this issue persisting and with no light at the end of the tunnel I decided to throw money at the problem, bought another disk and moved back over to ZFS. Would love to use MergerFS in the setup I described in the original post, but having my main storage disappear from my production systems is not a state to be in long-term.
@Janbong And are you using NFS too?
@gogo199432 Are your mergerfs and NFS settings the same as the original post? Have you tried modifying NFS settings? Do you have other FUSE filesystems mounted? Are you doing any modifying of the export outside NFS?
I'm also relying on NFS to let services running on VMs access the MergerFS mount. But that was also the case when MergerFS was running on one of the VMs, which worked without any issues for years .
Yes, but was it the same kernel? Same NFS config? Same NFS version?
- Different kernel. Now using
pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve). Ubuntu 18.04 VM which worked for years was running kernel4.15.0-213-generic - Exact same NFS config, copied it from the VM to the PVE host when I migrated.
What were the versions? What are the settings?
I was using the same settings as previously except for reducing to single disk. Apart from what we talked about with caching I have not been modifying the NFS settings. I tried removing the whole posix settings, but that has made no difference. If you mean another system apart from MergerFS that uses FUSE, then no I do not have anything else. As far as I can tell I have been doing no modifications outside of NFS, as I have said the only out-of-band thing was the backup but I moved that to a second MergerFS mount. So at the time of the issue with Jellyfin, there was nothing running on Proxmox that would access the mount.
What were the versions? What are the settings?
Looking for a way to check the version. I think both are enabled looking at output of nfsstat -s:
root@pve1:~# nfsstat -s
Server rpc stats:
calls badcalls badfmt badauth badclnt
2625399 0 0 0 0
Server nfs v3:
null getattr setattr lookup access
6 0% 844071 32% 204 0% 797 0% 150711 5%
readlink read write create mkdir
0 0% 1206289 45% 2833 0% 88 0% 4 0%
symlink mknod remove rmdir rename
0 0% 0 0% 0 0% 0 0% 0 0%
link readdir readdirplus fsstat fsinfo
0 0% 0 0% 4324 0% 414536 15% 8 0%
pathconf commit
4 0% 1503 0%
Server nfs v4:
null compound
4 6% 54 93%
Server nfs v4 operations:
op0-unused op1-unused op2-future access close
0 0% 0 0% 0 0% 2 1% 0 0%
commit create delegpurge delegreturn getattr
0 0% 0 0% 0 0% 0 0% 36 24%
getfh link lock lockt locku
4 2% 0 0% 0 0% 0 0% 0 0%
lookup lookup_root nverify open openattr
2 1% 0 0% 0 0% 0 0% 0 0%
open_conf open_dgrd putfh putpubfh putrootfh
0 0% 0 0% 34 23% 0 0% 8 5%
read readdir readlink remove rename
0 0% 0 0% 0 0% 0 0% 0 0%
renew restorefh savefh secinfo setattr
0 0% 0 0% 0 0% 0 0% 0 0%
setcltid setcltidconf verify write rellockowner
0 0% 0 0% 0 0% 0 0% 0 0%
bc_ctl bind_conn exchange_id create_ses destroy_ses
0 0% 0 0% 4 2% 2 1% 2 1%
free_stateid getdirdeleg getdevinfo getdevlist layoutcommit
0 0% 0 0% 0 0% 0 0% 0 0%
layoutget layoutreturn secinfononam sequence set_ssv
0 0% 0 0% 4 2% 44 30% 0 0%
test_stateid want_deleg destroy_clid reclaim_comp allocate
0 0% 0 0% 2 1% 2 1% 0 0%
copy copy_notify deallocate ioadvise layouterror
0 0% 0 0% 0 0% 0 0% 0 0%
layoutstats offloadcancel offloadstatus readplus seek
0 0% 0 0% 0 0% 0 0% 0 0%
write_same
0 0%
Settings:
root@pve1:~# cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes gss/krb5i(rw,sync,no_subtree_check)
#
/mnt/storage redacted_ip/24(rw,async,no_subtree_check,fsid=0) redacted_ip/8(rw,async,no_subtree_check,fsid=0) redacted_ip/24(rw,async,no_subtree_check,fsid=0)
/mnt/seed redacted_ip/24(rw,async,no_subtree_check,fsid=1) redacted_ip/8(rw,async,no_subtree_check,fsid=1) redacted_ip/24(rw,async,no_subtree_check,fsid=1)
Client side reporting that it's using NFSv3:
21:02:00 in ~ at k8s ➜ cat /proc/mounts | grep nfs
fs1.bongers.lan:/mnt/storage /mnt/storage nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,na
.4.90,mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip
fs1.bongers.lan:/mnt/seed /mnt/seed nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=2
mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip
The pattern I'm seeing is Proxmox + NFS. While I certainly won't rule out a mergerfs bug the fact everyone who experiences this seems to have that setup suggests it could be the proxmox kernel. I'm going to reach out to the fuse community to see if anyone has any ideas.
Some notes:
- I setup a dedicated machine running latest Armbian (x86), mounted a SSD formatted with ext4, put a mergerfs mount over it, exported it and from another machine mounted that export. I used a stress tool I wrote (bbf) to hammer the NFS mount for several hours. No issues.
- I will setup Proxmox and attempt the same. If the stress test fails I'll try adding some media and installing Plex I guess. For some reason Proxmox doesn't like my extra machine I have been testing with so I'll try with a VM i guess.
- I think what is happening is that NFS is getting into a bad state and causing mergerfs (or kernel side of the relationship) to get into a bad state due to ... something... maybe metadata changing under it in a way it doesn't like. For the root of the mount. And then that "sticks" and therefore no new requests can be made because the root is lookup is failing. I was under the impression that the kernel does not keep forever that error state but maybe I'm wrong or maybe it is triggering something different than normal. Either way I will probably need to fabricate errors to see if it behaves similarly to what you all are seeing.
All that said: NFS and FUSE just do not play nicely together. There are fundamental issues with how the two interact. Even if that were fixed it would still be complicated to get it working flawlessly on my end. I've gotten some feedback from kernel devs on the topic and will write something up about it later once I test some things out but I think after this I'm going to have to recommend people not use NFS or at least "you're on your own". I'll offer suggestions on setup to minimize issues but at the end of the day 100% support is not likely.
Looking over the kernel code I've narrowed down situations where EIO errors can be returned and are "sticky". I think I can add some debug code to detect the situation and then maybe we can figure out what is happening.
I appreciate you looking into this this thoroughly. Thanks!
In the mean time, the problem occurred two more times on my end.
- Once the "local" mount itself was working, but the NFS mounts on my VM returned error "stale file error".
- Just a couple of hours ago the full error like described above: Input/Output error on the NFS server side as well as the client VMs mounting the exported share.
I have disks spinning down automatically. I found in the logs that some time before the error occurred, all disks started spinning up, which leads me to believe some kind of scan was initiated. Plex scan possible, Sonarr/Radarr scan. Something like that probably. Which reads a lot of files in a short period of time. Not sure how long before the actual error occurred as I am basing myself of off the a timestamp when I got a message from someone using a service and it suddenly stopped working for them.
To be completely transparent on how I set things up: my (3) data disks are XFS, my SnapRAID parity is ext4. Though the parity disk is not really relevant here, I guess, since it is not part of the MergerFS mount.
I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all. My bet is Proxmox doing something MergerFS can't handle actually. It's a little too coincidental that @gogo199432 is also running MergerFS on Proxmox and running into the same issue.
My previous setup was MergerFS on a Ubuntu VM, albeit with a HBA in between as opposed to directly connected drives to the motherboard using SATA in my current setup on Proxmox.
What is your mergerfs config? And are you changing things out of band? Those two errors are pretty different things.
I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all.
Yes, but both mergerfs and the kernel and NFS code had evolved. The kernel code has gotten more strict about certain security concerns in and after 5.14.
If the kernel marks a node bad (literally called fuse_make_bad(inode)) there is nothing I can do. Hence why I'm trying to understand how NFS is triggering the root to be marked as such. Because this is not a unique situation. NFS shouldn't cause an issue any more than normal use. If there is a bug in the kernel that is leading to this I likely can't do much about it till addressed by the kernel devs. Or Proxmox updates their kernel.
my /etc/fstab (MergerFS settings):
/mnt/disk* /mnt/storage fuse.mergerfs direct_io,defaults,allow_other,noforget,use_ino,minfreespace=50G,fsname=mergerfs 0 0
@Janbong As mentioned in the docs direct_io, allow_other, and use_ino are deprecated. And for NFS usage you should really be setting inodecalc=path-hash for consistent inodes.
Would it be possible to try building the nfsdebug branch? Or tell me what version of debian proxmox is based on and I can build packages.
I put in debugging info that will be printed to syslog (journalctl) in cases that the kernel would normally mark things as errored.