mergerfs
mergerfs copied to clipboard
Mergerfs mount not responsive intermittently
issue description
- I am having an intermittent issue where all attempts to read/access the /data/media/extd1 mountpoint hang. This includes running command line tools on the host like ls and df which try to read the mount and hang.
- This server runs several docker containers including plex which read/write to this mount. When this hangs they can't read the mount either.
- Mergerfs is installed natively on the host.
- extd1 contains media files only, no docker or application configuration files or databases.
- While the issue is occuring, the underlying disks (encd1, encd2) can still be read
- /dev/sdc,sdd are in an external usb enclosure
- The issue seemingly happens randomly and have not been able to reproduce it. Nothing obvious in dmesg indicating what the issue is.
- Rebooting the system resolves the issue and waiting an amount of time it seems to correct itself.
- Seems to happen once a day, and there is usually moderate disk r/w activity; file downloads and media playback occurring when it happens
system info
Strace was run while mount was unresponsive, started the mergerfs trace and then a strace on 'df':
- app.strace.txt
- mergerfs.strace.txt (this is quite large was not sure when to trim)
$ sudo mergerfs --version
mergerfs v2.40.2
$ uname -a
Linux nuc 6.5.0-41-generic #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
$ df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 3.1G 3.2M 3.1G 1% /run
/dev/nvme0n1p2 91G 29G 58G 34% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
efivarfs 192K 118K 70K 63% /sys/firmware/efi/efivars
encd1:encd2:ssd4tb 33T 17T 17T 50% /data/media/extd1
/dev/sdb 916G 2.8M 870G 1% /data/unsorted
/dev/nvme0n1p1 511M 6.1M 505M 2% /boot/efi
/dev/nvme0n1p3 3.6T 397G 3.2T 11% /data/media/ssd4tb
/dev/sdc 15T 15T 474G 97% /data/media/encd1
/dev/sdd 15T 1.8T 13T 13% /data/media/encd2
tmpfs 3.1G 72K 3.1G 1% /run/user/127
tmpfs 3.1G 60K 3.1G 1% /run/user/1000
$ ls -la /data/media/
total 16
drwxrwxr-x 8 nuc nuc 4096 Jun 14 22:42 .
drwxr-xr-x 4 nuc nuc 4096 Jan 4 19:19 ..
drwxrwxr-x 5 nuc nuc 59 Jan 12 15:11 encd1
drwxrwxr-x 5 nuc nuc 46 Jun 25 09:13 encd2
drwxrwxr-x 5 nuc nuc 59 Jan 12 15:11 extd1
drwxrwxr-x 2 nuc nuc 4096 Jan 3 20:30 samples
drwxr-xr-x 6 nuc nuc 79 Jun 18 21:32 ssd4tb
drwxr-xr-x 3 nuc nuc 4096 Jan 4 19:39 unsorted
$ cat /etc/fstab
UUID=d39c17db-0794-4efd-aebf-215dda148e4f / ext4 errors=remount-ro 0 1
UUID=0139-3011 /boot/efi vfat umask=0077 0 1
/swapfile none swap sw 0 0
UUID=d239f2c1-8e04-46f3-987d-41469ad23e47 /data/media/encd1 xfs defaults,nofail 0 0
UUID=dd6ba9fc-b9aa-4136-94d9-9595ea5216e7 /data/unsorted ext4 defaults,nofail 0 0
UUID=42cdb2ca-50a3-49aa-8c70-83b52399af4f /data/media/ssd4tb xfs defaults 0 0
UUID=d673b99a-6efe-48dd-982d-77be75f9e512 /data/media/encd2 xfs defaults,nofail 0 0
/data/media/encd1:/data/media/encd2:/data/media/ssd4tb /data/media/extd1 mergerfs cache.files=partial,dropcacheonclose=true,category.create=mfs,allow_other 0 0
$ lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
loop0 0 100% /snap/bare/5
loop1 0 100% /snap/canonical-livepatch/282
loop2 0 100% /snap/core18/2823
loop3 0 100% /snap/core18/2829
loop4 0 100% /snap/core20/2264
loop5 0 100% /snap/core20/2318
loop6 0 100% /snap/core22/1122
loop7 0 100% /snap/core22/1380
loop8 squashfs 4.0 0 100% /snap/gnome-3-38-2004/140
loop9 squashfs 4.0 0 100% /snap/gnome-3-38-2004/143
loop10 squashfs 4.0 0 100% /snap/gnome-42-2204/172
loop11 squashfs 4.0 0 100% /snap/gnome-42-2204/176
loop12 squashfs 4.0 0 100% /snap/gtk-common-themes/1535
loop13 squashfs 4.0 0 100% /snap/rclone/495
loop14 squashfs 4.0 0 100% /snap/rclone/504
loop15 squashfs 4.0 0 100% /snap/snap-store/1113
loop16 squashfs 4.0 0 100% /snap/snap-store/959
loop17 squashfs 4.0 0 100% /snap/snapd/21465
loop18 squashfs 4.0 0 100% /snap/snapd/21759
loop19 squashfs 4.0 0 100% /snap/snapd-desktop-integration/83
loop20 squashfs 4.0 0 100% /snap/snapd-desktop-integration/157
sda xfs 8f58a793-94d2-4daf-a95a-db407648fef2
sdb ext4 1.0 dd6ba9fc-b9aa-4136-94d9-9595ea5216e7 869.2G 0% /data/unsorted
sdc xfs d239f2c1-8e04-46f3-987d-41469ad23e47 473.8G 97% /data/media/encd1
sdd xfs d673b99a-6efe-48dd-982d-77be75f9e512 12.8T 12% /data/media/encd2
nvme0n1
├─nvme0n1p1 vfat FAT32 0139-3011 504.9M 1% /boot/efi
├─nvme0n1p2 ext4 1.0 d39c17db-0794-4efd-aebf-215dda148e4f 57.4G 31% /
└─nvme0n1p3 xfs 42cdb2ca-50a3-49aa-8c70-83b52399af4f 3.2T 11% /data/media/ssd4tb
$ sudo lshw -class disk -class storage
*-nvme
description: NVMe device
product: CT4000P3SSD8
vendor: Micron/Crucial Technology
physical id: 0
bus info: pci@0000:01:00.0
logical name: /dev/nvme0
version: P9CR30A
serial: 2308E6B1EC7D
width: 64 bits
clock: 33MHz
capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:c0a9c0a92308E6B1EC7D CT4000P3SSD8 state=live
resources: irq:16 memory:6a400000-6a403fff
*-namespace:0
description: NVMe disk
physical id: 0
logical name: hwmon1
*-namespace:1
description: NVMe disk
physical id: 2
logical name: /dev/ng0n1
*-namespace:2
description: NVMe disk
physical id: 1
bus info: nvme@0:1
logical name: /dev/nvme0n1
size: 3726GiB (4TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=0c0465f1-c24b-4482-83d3-3613120847c6 logicalsectorsize=512 sectorsize=512 wwid=nvme.c0a9-323330384536423145433744-435434303030503353534438-00000001
*-usb
description: Mass storage device
product: RTL9210B-CG
vendor: Realtek
physical id: 2
bus info: usb@2:2
logical name: scsi3
version: 20.01
serial: 012345679039
capabilities: usb-3.20 scsi
configuration: driver=uas maxpower=896mA speed=10000Mbit/s
*-usb
description: Mass storage device
product: USB to ATA/ATAPI Bridge
vendor: JMicron
physical id: 3
bus info: usb@4:3
logical name: scsi2
version: 2.05
serial: 152D00539000
capabilities: usb-3.00 scsi emulated scsi-host
configuration: driver=usb-storage maxpower=8mA speed=5000Mbit/s
*-disk:0
description: SCSI Disk
product: M001G-2KK103
vendor: ST16000N
physical id: 0.0.0
bus info: scsi@2:0.0.0
logical name: /dev/sdc
logical name: /data/media/encd1
version: 0125
size: 14TiB (16TB)
configuration: ansiversion=6 logicalsectorsize=512 mount.fstype=xfs mount.options=rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota sectorsize=4096 state=mounted
*-disk:1
description: SCSI Disk
product: M000J-2TW103
vendor: ST16000N
physical id: 0.0.1
bus info: scsi@2:0.0.1
logical name: /dev/sdd
logical name: /data/media/encd2
version: 0125
size: 14TiB (16TB)
configuration: ansiversion=6 logicalsectorsize=512 mount.fstype=xfs mount.options=rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota sectorsize=4096 state=mounted
*-sata
description: SATA controller
product: Intel Corporation
vendor: Intel Corporation
physical id: 17
bus info: pci@0000:00:17.0
logical name: scsi1
version: 20
width: 32 bits
clock: 66MHz
capabilities: sata msi pm ahci_1.0 bus_master cap_list emulated
configuration: driver=ahci latency=0
resources: irq:150 memory:6a500000-6a501fff memory:6a503000-6a5030ff ioport:3090(size=8) ioport:3080(size=4) ioport:3060(size=32) memory:6a502000-6a5027ff
*-disk
description: ATA Disk
product: NT-2TB 2242
physical id: 0.0.0
bus info: scsi@1:0.0.0
logical name: /dev/sda
version: 109
serial: 0006946000117
size: 1907GiB (2048GB)
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
I need a trace from an app experiencing the issue. Your trace is of df and it looks to exit fine and the query of mergerfs took 0.0096 seconds to run.
And what is going on on the system? Have you run free? iotop? What's the loadavg? Etc.
Thanks for the reply. I think the strace shows the df command not completing for ~58 seconds, no activity from 19:47:00 - 19:47:58
I thought df/ls on the host would be a good test rather than the actual apps (which are dockerised).
I can give the output of those other commands next time it happens.
Using basic tooling is what I suggest using in the docs but a filesystem is a complex beast with many types of functions so getting more than one function testing is useful in debugging. statfs is different from stat is different from file creation, etc. That's why simple tools like "touch", "ls", "rm", "ln", etc. are useful for debugging. They don't do a whole lot. In any case I didn't notice that as later in the log it shows interactions with mergerfs working fine so it isn't entirely held up.
There mergerfs trace shows nothing but a gap in time around then which df had issued a stat request. Without logs upto the event or external information there isn't much there to comment on.
After that point things look pretty normal. stat'ing files mostly.
Do you have any advice on how I can capture traces for this given it’s intermittent?
I was thinking about setting up a rolling trace buffer on the key processes and using ls/df as the test, when it takes >20 seconds to complete to write out the trace buffer to record the trace before the issue starts. Would this work and help to diagnose the issue?
Having traces will only show who is being slow. It won't necessarily diagnose anything. There is very little going on in mergerfs that can cause slowdown. It's primarily just a proxy. Mostly, if there is slowdown it's because the underlying devices are overwhelmed or there is buffer bloat and swapping going on. This is why general system measurements can be important here. Load avg, IO waits, etc.
I'm not aware of any built in sampling behavior in strace. The problem is if you have some external loop checking for delay and it is buffer bloat related or similar then by the time you trace it could be done flushing.
The easiest thing would be to disable all the caches and ensure flush on close is enabled and compare behavior.
Hi @trapexit, I am not sure if my comment is in the right place here, but I'll try anyway. I was experiencing a similar issue just now. In my case I am merging a remote directory on onedrive (via rclone) and a local directory together. Also in my case, mergerfs was not at fault - clearly the remote directory on onedrive was.
But I was thinking if there was a feature to configure timeouts so that mergerfs does not wait for an answer from the underlying file system if a certain timeout was reached? Of course, I will check that for rclone too. But I also had similar issues in the past where the SAN storage I merged was experiencing too high load so that read requests were running in a timeout. That's where I can't rely on rclone either.
tl;dr: is there a way to configure a timeout for reads/writes?
I would prefer moving such discussion to the discussions page rather than overloading an existing ticket.
But to respond: there is no such thing as a timeout in filesystem APIs. They block. They can block forever and that is "valid." You can't tell the kernel "try to write this data but if it takes more than X nanoseconds abort." That's simply not a thing. The best mergerfs could do is, under the assumption that the call will return eventually, time some specific request to the kernel and if it takes a long time put the branch in "timeout" or in a "bad" state which would result it in it being ignored going forward. But it would probably need to be "takes X nanoseconds + returns some error condition" which might catch certain network filesystem issues. But then the question is... then what? Does it stay in the bad state till someone manually changes it? Is the expectation that mergerfs would try to watch for it to "get better"? This is a pretty non-trivial issue.
I would prefer moving such discussion to the discussions page
Alright, thanks for your response though!
This is a pretty non-trivial issue.
Understood, thank you for the explanation. I guess, I'll have to deal with it differently (maybe have some script periodically check the response times and kill the rclone connection, as the two only parameters I was able to find, --contimeout and --timeout, don't seem to fix my issue with rclone).