mergerfs icon indicating copy to clipboard operation
mergerfs copied to clipboard

MergerFS becomes very slow when adding an sshfs mount.

Open garrettdowd opened this issue 4 years ago • 25 comments

Describe the bug

Listing a larger mergerfs directory (~3000 combined files) takes >90 seconds, while individually listing the constituent directories takes <1 second. The filesystems involved are

  • rclone mount (crypt gdrive) (50 files)
  • sshfs (0 files)
  • unraid local user share (3000 files)

The problem only occurs when all three are combined by mergerfs. Merging the rclone mount and the unraid share (no sshfs) returns a listing in <2 seconds. I do not understand this behavior since the sshfs mount does not even have any files in this directory. This problem generalizes to other similar large directories.

To Reproduce Settings are given below. Remote filesystems are mounted and then mergerfs mount is started. ls <dir> is attempted.

Expected behavior

Much faster performance when all three filesystems are merged. Currently unusable

System information:

  • OS, kernel version: 5.10.1-Unraid
  • mergerfs version: 2.32.4
  • mergerfs settings
  • List of drives, filesystems, & sizes:
sshfs -o IdentityFile=**** -o allow_other ***@**********/files/ /mnt/remotes/sshfs/

 rclone mount \
   --allow-other \
   --buffer-size=128M \
   --daemon \
   --daemon-timeout=5m \
   --dir-cache-time=72h \
   --drive-chunk-size=128M \
   --log-level=INFO \
   --poll-interval=5m \
   --timeout=30m \
   --vfs-cache-max-age=6h \
   --vfs-cache-max-size=100G \
   --vfs-cache-mode=minimal \
   --vfs-read-chunk-size=128M \
   --vfs-read-chunk-size-limit off \
   media: /mnt/remotes/media/        (crypt google drive)

mergerfs /mnt/remotes/sshfs/:/mnt/remotes/media/:/mnt/user/ /mnt/mergerfs/media/ -o rw,async_read=false,use_ino,allow_other,func.getattr=newest,category.action=all,category.create=ff,cache.files=partial,dropcacheonclose=true
  • A strace of the application having a problem:
    • strace -fvTtt -s 256 -o /tmp/app.strace.txt <cmd> strace: error while loading shared libraries: libdw.so.1: cannot open shared object file: No such file or directory

Unfortunately I was unable to run an strace. Not sure how to resolve this error on unraid.

garrettdowd avatar Mar 18 '21 01:03 garrettdowd

I can't speak to the issue with strace on UnRAID. Should talk to the vendor.

As for performance you have to ensure you're comparing like for like. The kernel will cache attributes, entities, and dirents. You should purge caches between any tests.

As for the performance: mergerfs is literally just looping over each path and doing exactly as ls does. The combined time should be in the ballpark of doing ls patha pathb pathc.

You can also use cache.readdir to cache mergerfs' readdir calls.

There isn't really much that can be done to improve performance outside doing the readdir requests in parallel (to which the perceptible improvement would greatly depend on the situation). It sounds like there is something else going on here. You're using 2 network filesystems which have high latency and could be blocking for some reason. An strace is really the only way to really tell.

trapexit avatar Mar 18 '21 02:03 trapexit

BTW... I just tested a local + sshfs setup and it seems absolutely fine including after purging caches.

trapexit avatar Mar 18 '21 02:03 trapexit

Noted about the caches!

Yeah, I will have to more closely look into the dependency problem with strace on unraid.

BTW... I just tested a local + sshfs setup and it seems absolutely fine including after purging caches.

This is also true for my setup. It seems the extreme latency only happens when it is rclone + sshfs + local.

I am spinning up an Ubuntu container (on a different host) to see if I have similar problems (and to be able to strace). Although this might might cause even more problems because then all three filesystems will be remote (NFS + rclone + sshfs). I guess we will see.

garrettdowd avatar Mar 18 '21 15:03 garrettdowd

That is curious. I'll test later with rclone in the mix.

trapexit avatar Mar 18 '21 18:03 trapexit

My Ubuntu container (on a different host) is showing similar behavior. The setup in the container is NFS - unraid share (3000 dirs) Rclone - google drive crypt mount (50 dirs) SSHFS - remote server (0 dirs)

Merging just NFS and Rclone folders do not have noticeable latency when listing the directory. Merging all three folders has extreme latency.

mergerfs /mnt/remotes/sshfs/****/:/mnt/remotes/nfs/unraid/:/mnt/remotes/rclone/media /mnt/mergerfs/test -o rw,use_ino,allow_other,func.getattr=newest,category.action=all,category.create=ff

Luckily strace does not have any problems on the Ubuntu container. However, interestingly, when running strace -fvTtt -s 256 -o /tmp/app.strace.txt ls DIR the directory listing completes extremely quickly.

I would attach the full strace to this comment, however strace lists all of the directory names which I would prefer not posting here. I attached strace after removing parts with directory names.

EDIT: removed irrelevant strace

garrettdowd avatar Mar 19 '21 15:03 garrettdowd

EDIT: I posted the original comment below too quickly. It is a lie. My tests were not including the correct sshfs mount location.

Details on the Ubuntu container (lxc on Debian host). mergerfs version: 2.21.0 (I didn't check this eithe and just apt installed mergerfs) FUSE library version: 2.9.7 fusermount version: 2.9.7 using FUSE kernel interface version 7.19

Furthermore, after proper tests cache.readdir/attr/entry does not seem to affect performance on either the Ubuntu container or unraid machine.

Back to square one

All a lie

So based on your earlier comment

You can also use cache.readdir to cache mergerfs' readdir calls.

I tried adding cache.readdir=true,cache.entry=30,cache.attr=30 which was resulting in an fuse: option unknown error on my Ubuntu test container.

Because the mergerfs docs say that cache.readdir has to be supported by the kernel I went to the Ubuntu 18.04 fuse documentation and found that cache.readdir,cache.entry,cache.attr do not exist. (because they appear to be Debain fuse options)

This was surprising since I was unaware that these mergerfs options were passed straight through to fuse.

HOWEVER, in Ubuntu there are the fuse options entry_timeout and attr_timeout. Setting these to reasonable values ~10-30 drastically improves performance.

Instead of 90+ seconds it cuts the initial ls down to <5 seconds (which is about the time it takes to ls /dir1 /dir2 /dir3) and then obviously it can list the dir in less than a second after cached.

I don't understand why these options improved performance so much in this case, so hopefully @trapexit could comment on it?

garrettdowd avatar Mar 22 '21 01:03 garrettdowd

You are using an old version of mergerfs then. Those aren't fuse options. They are mergerfs ones as found in the man page and web docs. If they don't work then you're using an old version.

trapexit avatar Mar 22 '21 01:03 trapexit

They improve perf so much on following calls because it's caching the relevant info as described in the docs. That doesn't explain the initial slowness.

trapexit avatar Mar 22 '21 01:03 trapexit

So yes, I was running an old mergerfs version on the Ubuntu container. I updated the original comment. Initial slowness is still a problem that breaks (times out) software that need to access the large directories.

garrettdowd avatar Mar 22 '21 01:03 garrettdowd

Interestingly enough the cache settings have not effect on Unraid. Repeated ls of the same directory takes same amount of time. No errors are given when mounting and I am running the latest version there (built using trapexit/mergerfs-static-build).

garrettdowd avatar Mar 22 '21 01:03 garrettdowd

What happens if the sshfs mount had (before merging with mergerfs) at least one file there (let's say exactly 1 file)? Still that slow?

dumblob avatar Oct 10 '21 23:10 dumblob

I experience a bit of a slow down after adding an sshfs mount. Before, it took a few seconds to list any directory only when a drive was idle, the rest of the time was faster. But after adding the sshfs mount I experience a slow down often, but not nearly 90 s, just 1-5 seconds.

$ sudo cat /etc/fstab
user@host_ip:/mnt/hdd  /mnt/hdd fuse.sshfs IdentityFile=/home/user/.ssh/id_rsa,uid=1000,gid=1000,allow_other,default_permissions,_netdev,follow_symlinks,ServerAliveInterval=45,ServerAliveCountMax=2,reconnect,noatime,auto 0
/mnt/hdd*  /mnt/storage  fuse.mergerfs  allow_other,use_ino,cache.files=off,dropcacheonclose=true,ignorepponrename=true,func.mkdir=epall,x-gvfs-show   0       0

ghost avatar Oct 13 '22 16:10 ghost

Network filesystems are inherently higher latency. Especially ones like sshfs. When making decisions across branches there is little that could be done to deal with this latency. mergerfs is a userland process... it interacts with these filesystems exactly the same way as any other piece of software. In some cases I could add concurrency to a behavior which may help but introduces other issues. Higher memory and cpu usage, more complex code, etc. And literally every single filesystem function has different behaviors, costs, caching, etc. so every single behavior has to be looked at individually.

trapexit avatar Oct 13 '22 16:10 trapexit

In some cases I could add concurrency to a behavior which may help but introduces other issues.

A promise-based concurrency e.g. for listing wouldn't disrupt the current code flow (actually it should be literally only a few lines of code) and might actually "solve" the issues this thread seems to be primarily about. A promise for each separate "merge point" could be enough (e.g. 3 promises for /mnt/remotes/sshfs/:/mnt/remotes/media/:/mnt/user/ shall behave as only waiting for the slowest of the three).

dumblob avatar Oct 13 '22 17:10 dumblob

Whether promises or any other mechanism for orchestration it really isn't the issue. It's the additional overhead. Many systems already have high io load due to the increased back and forth necessary from userspace and across branches. Having additional threads, possibly parallel IO requests across numerous devices, overhead of IO bouncing between threads, additional memory needed to store results (not much for most but could for readdir), etc... it adds up.

Concurrent versions of every call would need to be done which is a lot of changes. One needs to articulate the particular functions to look at first.

trapexit avatar Oct 13 '22 17:10 trapexit

Ok, I have briefly looked into it and you are right that threading-based promises would be non-negligible work.

Do we actually have some profiling measurements? Is it more about opendir() or readdir() or closedir() or stat() without leveraging additional information provided e.g. by aio_readdirx()?

I found a nice overview of several tricks leading to wastly increased directory listing performance (incl. on networked filesystems): http://blog.schmorp.de/2015-07-12-how-to-scan-directories-fast-the-tricks-of-aio_scandir.html .

Could we measure where the slowness exactly comes from and then maybe employ (all of) the mentioned tricks?

It is worth noting implementing the tricks shall be much simpler than threading-based concurrency.

dumblob avatar Oct 13 '22 18:10 dumblob

I don't have anything with mergerfs directly but I have written things over the years to test perf for readdir in particular. I didn't find a big difference between "opendir, readdir, closedir" in a loop and doing the opendir's upfront and then doing all the readdir sequentially or concurrently. But honestly it's not the easiest to test without a bunch of heterogeneous filesystems/devices. Using getdents on Linux helps for sure but nothing game changing. I already have a POSIX and Linux version of readdir in mergerfs but found little difference in practice and never made it configurable.

As for that post and tricks.... I can't speak to how Perl implements that but readdir boils down to getdents and there is little else you can do there besides change buffer sizes vs the default glibc 32KB (IIRC). For very large dirs it matters but not the average so much. The inode thing is not relevant here unless it is readdirplus. And there is no such thing as a true async filesystem call, let alone lstat, except with iouring which could be used for some calls (not all fs calls are supported by it) but the code would have to be reworked to concurrent APIs and then iouring used. and *at functions... yeah they can be used and are some places but to truly take advantage of them I would need to keep a pool of files/dirs open and for certain workloads (which mergerfs really can't be tailored for generally) it would just be an overhead with no benefit.

There is also the ability to use statx instead of lstat and tell it it is OK to return semistale data. I tried to see if that mattered for NFS but the results seemed identical to lstat. Still on my todo list but unclear it will matter.

trapexit avatar Oct 13 '22 19:10 trapexit

but readdir boils down to getdents and there is little else you can do there besides change buffer sizes vs the default glibc 32KB (IIRC)

Just to clarify - do you mean "readahead" buffering (i.e. using most of the tricks incl. heuristics, inode tricks etc. - i.e. guesstimating what request/call comes next) or just caching of already read information?

dumblob avatar Oct 13 '22 22:10 dumblob

No

POSIX defines readdir function. It returns a pointer to a static buffer. You don't control how much data is returned from the kernel.

Linux doesn't have a readdir syscall. It has getdents. https://man7.org/linux/man-pages/man2/getdents.2.html You give it a buffer and size and the kernel can fill upto that much dentries. glibc uses a buffer of 32KB. If you increase that value you can get better perf by reducing the number of round trips to the kernel for large directories.

Has nothing at all to do with readahead or any of those supposed tricks (most of which aren't relevant here.)

trapexit avatar Oct 13 '22 23:10 trapexit

There is a bottleneck with FUSE with readdir... it too does not allow for dynamic buffer sizes for client readdir calls. It transfers 4k at a time. Kernel folks have been open to increasing it but I've never pushed the issue.

trapexit avatar Oct 13 '22 23:10 trapexit

Well, 4k at a time might be a bit too low but I do not think it is one of the major sources of the delay. I also have some doubts the kernel<->userspace switching is the major contribution to the slowness in this case. So I would not focus on these that much.

What I had in mind is than once someone calls readdir() on mergefs then mergefs could on its own prefetch all of the inodes in the chain and on subsequent readdir() calls only return what is in cache until all the cache has been read or until closedir() gets called. And this prefetching would leverage all of the tricks I linked (the idea is that whenever listing a dir, 99.9% of apps need to distinguish a file from a directory).

dumblob avatar Oct 14 '22 09:10 dumblob

The 4K limit absolutely is a contributor on any larger system. Just do some basic getdent tests with 4K vs 32K. For larger directories it adds a quite a bit of overhead. For a dir with 32K worth of dents it 8x the calls to the kernel. This is the same issue just in a different spot.

I also have some doubts the kernel<->userspace switching is the major contribution to the slowness in this case.

It is a contributor to 100% of all FUSE interactions. client app makes request (context switch and io wait), kernel dispatches to mergerfs (context switch), mergerfs receives request and has to do said action across N branches (N context switches at least and IO waits), mergerfs sends request back to kernel (context switch), kernel dispatches back to client (context switch).

What I had in mind is than once someone calls readdir() on mergefs then mergefs could on its own prefetch all of the inodes in the chain and on subsequent readdir() calls only return what is in cache until all the cache has been read or until closedir() gets called. And this prefetching would leverage all of the tricks I linked (the idea is that whenever listing a dir, 99.9% of apps need to distinguish a file from a directory).

It does that today. Go read the code. On readdir it reads everything and dishes it out as requested. That's because doing actual streaming would be expensive to keep track of.

Those "tricks" really aren't. They don't apply here. The only one that would have any impact is using *at calls which I've described why that isn't a simple win.

I will go through them:

0 async in general ) There is no such thing as async readdir. Not yet anyway. There was an attempt to get a io_uring op getdents but it doesn't appear to have been upstreamed. Even if it were few would have a system that supported it so a backup would be needed. 1a) This is just normal readdir. They are comparing it to perl's API. And I can't sort anything because that's not how things work. I'm not stat'ing files in readdir. Only readdirplus would be and that's not being used here. 1b) Again... this is normal readdir. Nothing special here. 1c) I'm not sorting anything. Userland software is. All I do is pull data, dedup entries, ship to kernel. 2 link count) Again this has nothing at all to do with what I'm doing. That is talking about recursive scans. I don't do that. I do a straight translation of readdir. 2 async lstat) The only async lstat is through io_uring.They aren't using that. They are talking about their aio library which likely uses threads. And unless the call is a readdirplus call there is no stat's going on here. 3 stat entry/.) Not relevant here. 4 at calls) Not relevant here. Not without keeping a pool of open FDs for every directory we see or at least the root of the branch. Which I've mentioned before has its own complications and tradeoffs. There is a real cost to having increase path resolving but it is a complicated tradeoff to make in a generic system that can't be designed for specific workflows.

Nothing on that page is relevant here.

trapexit avatar Oct 14 '22 12:10 trapexit

I see what you mean. I just can not believe this all adds up to such delays as observed. Do we have an easy way to profile everything happening in mergefs (yeah, good old flame graphs) when I run ls on mergefs in different scenarios:

  1. mergefs only with one HDD/SSD
  2. mergefs only with one sshfs
  3. mergefs over two: one HDD/SSD and one sshfs

dumblob avatar Oct 14 '22 22:10 dumblob

The code is very simple. Pick a function and look for fuse_FUNCTIONNAME.cpp.

An ls is a readdir loop and stats. Almost identical to simply running ls across the paths in question on the underlying branches.

Of course you can use traditional profilers but to what end? The issue isn't compute. It's not going to show you anything interesting. mergerfs isn't compute heavy. It's all IO and latency.

trapexit avatar Oct 14 '22 23:10 trapexit

You can easily see the calls mergerfs makes by using strace as described in the docs.

trapexit avatar Oct 14 '22 23:10 trapexit