gvisor
gvisor copied to clipboard
Slow vfs for directories with large number of files
Description
Gvisor (with vfs2) is very slow when accessing directories with a huge number (50000) of files on an external mount. Operations inside such a directory can take hundreds of milliseconds. Even a simple getdents64 syscall can take a very long time as gvisor performs a stat on every file in the directory. Accessing a non-existent file leads to similar behaviour. The performance difference compared to native access is enormous as gvisor is 100x slower in these cases.
Steps to reproduce
Slow getdents64 performance
- Create an external mount containing a directory with a huge number (50000) of files.
- Do a getdents64 syscall from within gvisor
- See that givsor performs a stat syscall for every file in the directory
A similar issue is that trying to access a non-existent file in the directory
- Create an external mount containing a directory with a huge number (50000) of files.
- Try to access a non-existent file in the directory
time cat DOES_NOT_EXIST
- See that a simple cat with ENOENT takes hundreds of milliseconds (and fails with a simple ENOENT)
runsc version
release-20210906.0
docker version (if using docker)
20.10.5
uname
5.10.0-8 (debian bullseye)
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
This is a pretty difficult problem to solve. gVisor virtualizes dev
and ino
to provide a consistent view of the filesystem and hide the different mappings that may be happening under the covers by the gofer. dev
is generated per filesystem (see VirtualFilesystem.GetAnonBlockDevMinor()
) while ino
is generated per dentry based on the file/dir's original dev
, ino
(see gofer.filesystem.inoFromQIDPath()
). getdents(2)
only returns ino
, requiring the extra stat(2)
call to fetch the dev
number for the file so that the virtualized ino
can be generated.
Having said that, I wonder if it would be possible to expose the original file dev
and ino
without virtualizing them in the sentry. They would not match their filesystem dev
, but not sure anyone relies on that. So here's an idea to explore:
cc @ayushr2
In LISAFS, Getdents64()
returns a slice that contain Ino
, DevMinor
and DevMajor
so that the Sentry can virtualize ino
(see inoFromKey()
). However we could push this responsibility to the gofer. In that case, Getdents64()
doesn't need to send DevMinor
and DevMajor
to the sentry anymore and fsgofer can skip the stat(2)
call altogether. Note that fsgofer doesn't need to virtualize ino
since it can pass the ino
straight from the host*. For gofers that are not using local host as a backing filesystem, they can be responsible to virtualize the ino
and dev
based on where the file is coming from (IOW move the ino
map from the sentry to the gofer). Another advantage is that in the fsgofer
case, we don't need to keep gofer.filesystem.inoByKey
mappings anymore.
* That assumes it won't break anyone if dev
for a given file doesn't match the filesystem where the file came from.
Your analysis is correct there. I did some benchmarking in https://github.com/google/gvisor/issues/6578#issuecomment-919736649 and lisafs does improve things.
But getting rid of the stat calls will surely improve things further.
Note that fsgofer doesn't need to virtualize ino since it can pass the ino straight from the host
^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar
and the application runs ls foo
. foo/baz
can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.
I had thought about this before but it quickly gets complicated. Lets say we try to scan the bind mount for mountpoints on startup (common case is no mount points) but that can change because bind mounts have remote revalidation cache policy, so the underlying system state can change under our feet any time.
It is annoying that we have to take such a massive hit for such an uncommon corner case. Any ideas?
^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar and the application runs ls foo. foo/baz can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.
They key is to also expose dev
number from the host too. ino
from different devices can be duplicated, the requirement is that {dev, ino}
are unique. The caveat is that dev
from the file will not match dev
from the filesystem (in the Sentry) that is serving the file.
They key is to also expose dev number from the host too.
IIUC you are suggesting that instead of returning our sentry-only virtualized device ID on stat: code pointer, we instead expose the actual device ID from host. How do we deal with conflicts with sentry virtualized device IDs?
Ah also I recall another issue I had thought of. We can not use the host inode number for files in the sentry because then we have no way of generating unique inode numbers for synthetic files in the same gofer mount. This was the deadend I hit.
we have no way of generating unique inode numbers for synthetic files in the same gofer mount.
Surely we can address this by using a different device ID for these special files. But then it gives the impression that these special files are mount points.
The whole premise of the idea is that we can freely return different device IDs for files within the same filesystem without breaking applications. If that holds true (I suspect that it won't, but can't think of a case where it breaks), then we play around with device IDs and ino to reduce the chances of a conflict.
Regarding "mapping host device numbers to sentry-synthetic device numbers", note that we already do something similar in overlayfs, and this is based on Linux's overlayfs behavior.
According to https://lwn.net/Articles/866582/, there are issues related to this:
-
Some applications may assume that each filesystem is associated with only a single device number, because historically this has been true; for example,
/proc/[pid]/mountinfo
reports a single device number per filesystem, which is "the value of st_dev for files on this filesystem" -proc(5)
. My impression is that the fact that Linux overlayfs doesn't have this property (without xino, which is not enabled by default), in conjunction with the popularity of Docker, has been forcing applications away from this assumption anyway. -
The Linux kernel's NFS daemon also makes the same assumption In NFS (both v3 and v4), file attributes contain a filesystem ID that affects protocol behavior, but no device number. Accordingly, Linux's NFS daemon discards per-file device numbers, resulting in inode number collisions for files on filesystems with multiple device numbers accessed over NFS. (This is the primary topic of the LWN article linked above, although it's focused on btrfs rather than overlayfs.) We don't have an NFS daemon, and if we did I think we'd require that it perform the mapping of (device number, inode number) to NFS file handle, constraining the performance impact of this limitation to the system that imposes it.
In summary, I think it's workable for the gofer client to maintain mappings of remote device numbers to sentry-synthetic device numbers, as well as a separate anonymous device number for synthetic mountpoints, and use remote inode numbers directly.
A friendly reminder that this issue had no activity for 120 days.
I recently implemented @nixprime's suggestion from above (use remote inode numbers directly & map remote device IDs) on my local machine and didn't see much performance gains on broader filesystem benchmarks. I believe it is because the following features are default now:
- rootfs overlay: this helps with newly created large directories in container rootfs. The directory lives in tmpfs, which is really fast to query.
- directfs: don't have to make multiple round trips to gofer to fetch dirents.
But @nixprime's suggestion also somewhat complicates the S/R use case. As of right now, S/R takes the responsibility of presenting the same inode/dev numbers before and after S/R for the same file. (Note however, that even in the current form, this is partially broken because only inode numbers in the dentry cache are remembered and restored. Inode numbers of dentries that are evicted before "save" operation are not restored.)
On restore, the filesystem may have been migrated and hence underlying inode numbers may have changed. Apart for remapping the new device numbers, we'd have to add a bunch of complex for restore case to remap new inode numbers to new ones and make sure getdents()
queries this map before returning. This also assumes that the inodes did not move across devices during migration.
@crappycrypto I am curious to know if slow directory operations for large directories is still an issue for you.
A friendly reminder that this issue had no activity for 120 days.
This issue has been closed due to lack of activity.
There are TODOs still referencing this issue:
- runsc/fsgofer/lisafs.go:1035: Get rid of per-dirent stat.
- pkg/sentry/fsimpl/gofer/directfs_dentry.go:584: Get rid of per-dirent stat.
Search TODO