gvisor Slow vfs for directories with large number of files

trafficstars

Description

Gvisor (with vfs2) is very slow when accessing directories with a huge number (50000) of files on an external mount. Operations inside such a directory can take hundreds of milliseconds. Even a simple getdents64 syscall can take a very long time as gvisor performs a stat on every file in the directory. Accessing a non-existent file leads to similar behaviour. The performance difference compared to native access is enormous as gvisor is 100x slower in these cases.

Steps to reproduce

Slow getdents64 performance

Create an external mount containing a directory with a huge number (50000) of files.
Do a getdents64 syscall from within gvisor
See that givsor performs a stat syscall for every file in the directory

A similar issue is that trying to access a non-existent file in the directory

Create an external mount containing a directory with a huge number (50000) of files.
Try to access a non-existent file in the directory time cat DOES_NOT_EXIST
See that a simple cat with ENOENT takes hundreds of milliseconds (and fails with a simple ENOENT)

runsc version

release-20210906.0

docker version (if using docker)

20.10.5

uname

5.10.0-8 (debian bullseye)

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

Sep 29 '21 19:09 crappycrypto

This is a pretty difficult problem to solve. gVisor virtualizes dev and ino to provide a consistent view of the filesystem and hide the different mappings that may be happening under the covers by the gofer. dev is generated per filesystem (see VirtualFilesystem.GetAnonBlockDevMinor()) while ino is generated per dentry based on the file/dir's original dev, ino (see gofer.filesystem.inoFromQIDPath()). getdents(2) only returns ino, requiring the extra stat(2) call to fetch the dev number for the file so that the virtualized ino can be generated.

Having said that, I wonder if it would be possible to expose the original file dev and ino without virtualizing them in the sentry. They would not match their filesystem dev, but not sure anyone relies on that. So here's an idea to explore:

cc @ayushr2

In LISAFS, Getdents64() returns a slice that contain Ino, DevMinor and DevMajor so that the Sentry can virtualize ino (see inoFromKey()). However we could push this responsibility to the gofer. In that case, Getdents64() doesn't need to send DevMinor and DevMajor to the sentry anymore and fsgofer can skip the stat(2) call altogether. Note that fsgofer doesn't need to virtualize ino since it can pass the ino straight from the host*. For gofers that are not using local host as a backing filesystem, they can be responsible to virtualize the ino and dev based on where the file is coming from (IOW move the ino map from the sentry to the gofer). Another advantage is that in the fsgofer case, we don't need to keep gofer.filesystem.inoByKey mappings anymore.

* That assumes it won't break anyone if dev for a given file doesn't match the filesystem where the file came from.

Oct 06 '21 21:10 fvoznika

Your analysis is correct there. I did some benchmarking in https://github.com/google/gvisor/issues/6578#issuecomment-919736649 and lisafs does improve things.

But getting rid of the stat calls will surely improve things further.

Note that fsgofer doesn't need to virtualize ino since it can pass the ino straight from the host

^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar and the application runs ls foo. foo/baz can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.

I had thought about this before but it quickly gets complicated. Lets say we try to scan the bind mount for mountpoints on startup (common case is no mount points) but that can change because bind mounts have remote revalidation cache policy, so the underlying system state can change under our feet any time.

It is annoying that we have to take such a massive hit for such an uncommon corner case. Any ideas?

Oct 06 '21 21:10 ayushr2

^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar and the application runs ls foo. foo/baz can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.

They key is to also expose dev number from the host too. ino from different devices can be duplicated, the requirement is that {dev, ino} are unique. The caveat is that dev from the file will not match dev from the filesystem (in the Sentry) that is serving the file.

Oct 06 '21 23:10 fvoznika

They key is to also expose dev number from the host too.

IIUC you are suggesting that instead of returning our sentry-only virtualized device ID on stat: code pointer, we instead expose the actual device ID from host. How do we deal with conflicts with sentry virtualized device IDs?

Ah also I recall another issue I had thought of. We can not use the host inode number for files in the sentry because then we have no way of generating unique inode numbers for synthetic files in the same gofer mount. This was the deadend I hit.

Oct 06 '21 23:10 ayushr2

we have no way of generating unique inode numbers for synthetic files in the same gofer mount.

Surely we can address this by using a different device ID for these special files. But then it gives the impression that these special files are mount points.

Oct 06 '21 23:10 ayushr2

The whole premise of the idea is that we can freely return different device IDs for files within the same filesystem without breaking applications. If that holds true (I suspect that it won't, but can't think of a case where it breaks), then we play around with device IDs and ino to reduce the chances of a conflict.

Oct 08 '21 01:10 fvoznika

Regarding "mapping host device numbers to sentry-synthetic device numbers", note that we already do something similar in overlayfs, and this is based on Linux's overlayfs behavior.

According to https://lwn.net/Articles/866582/, there are issues related to this:

Some applications may assume that each filesystem is associated with only a single device number, because historically this has been true; for example, /proc/[pid]/mountinfo reports a single device number per filesystem, which is "the value of st_dev for files on this filesystem" - proc(5). My impression is that the fact that Linux overlayfs doesn't have this property (without xino, which is not enabled by default), in conjunction with the popularity of Docker, has been forcing applications away from this assumption anyway.
The Linux kernel's NFS daemon also makes the same assumption In NFS (both v3 and v4), file attributes contain a filesystem ID that affects protocol behavior, but no device number. Accordingly, Linux's NFS daemon discards per-file device numbers, resulting in inode number collisions for files on filesystems with multiple device numbers accessed over NFS. (This is the primary topic of the LWN article linked above, although it's focused on btrfs rather than overlayfs.) We don't have an NFS daemon, and if we did I think we'd require that it perform the mapping of (device number, inode number) to NFS file handle, constraining the performance impact of this limitation to the system that imposes it.

In summary, I think it's workable for the gofer client to maintain mappings of remote device numbers to sentry-synthetic device numbers, as well as a separate anonymous device number for synthetic mountpoints, and use remote inode numbers directly.

Oct 08 '21 21:10 nixprime

A friendly reminder that this issue had no activity for 120 days.

Sep 13 '23 00:09 github-actions[bot]

I recently implemented @nixprime's suggestion from above (use remote inode numbers directly & map remote device IDs) on my local machine and didn't see much performance gains on broader filesystem benchmarks. I believe it is because the following features are default now:

rootfs overlay: this helps with newly created large directories in container rootfs. The directory lives in tmpfs, which is really fast to query.
directfs: don't have to make multiple round trips to gofer to fetch dirents.

But @nixprime's suggestion also somewhat complicates the S/R use case. As of right now, S/R takes the responsibility of presenting the same inode/dev numbers before and after S/R for the same file. (Note however, that even in the current form, this is partially broken because only inode numbers in the dentry cache are remembered and restored. Inode numbers of dentries that are evicted before "save" operation are not restored.)

On restore, the filesystem may have been migrated and hence underlying inode numbers may have changed. Apart for remapping the new device numbers, we'd have to add a bunch of complex for restore case to remap new inode numbers to new ones and make sure getdents() queries this map before returning. This also assumes that the inodes did not move across devices during migration.

Sep 13 '23 04:09 ayushr2

@crappycrypto I am curious to know if slow directory operations for large directories is still an issue for you.

Sep 13 '23 04:09 ayushr2

A friendly reminder that this issue had no activity for 120 days.

Jan 12 '24 00:01 github-actions[bot]

This issue has been closed due to lack of activity.

Apr 11 '24 00:04 github-actions[bot]

There are TODOs still referencing this issue:

runsc/fsgofer/lisafs.go:1035: Get rid of per-dirent stat.
pkg/sentry/fsimpl/gofer/directfs_dentry.go:584: Get rid of per-dirent stat.

Search TODO

Apr 12 '24 00:04 github-actions[bot]

gvisor gvisor copied to clipboard

Slow vfs for directories with large number of files

Description

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

gvisor
gvisor copied to clipboard