syzkaller syzbot: fs coverage dropped in v6.9-rc1

After the first v6.9 commits have reached torvalds, we have seen a big drop in the number of covered trace points in fs/. This is best seen on our ci2-upstream-fs instance: we went from 178k to ~100k.

Sample coverage report before: https://storage.googleapis.com/syzbot-assets/5e2921d5b3d3/ci2-upstream-fs-09e5c48f.html

After: https://storage.googleapis.com/syzbot-assets/be3e453a28d3/ci2-upstream-fs-fe46a7dd.html

As can be seen, it's not related to any specific filesystem -- no fs/X's coverage has dropped to 0, everything just decreased 1.5-2x.

I've looked at the git logs, and so far I have only identified this kernel commit that could have led to this change:

    fs,block: get holder during claim
    
    Now that we open block devices as files we need to deal with the
    realities that closing is a deferred operation. An operation on the
    block device such as e.g., freeze, thaw, or removal that runs
    concurrently with umount, tries to acquire a stable reference on the
    holder. The holder might already be gone though. Make that reliable by
    grabbing a passive reference to the holder during bdev_open() and
    releasing it during bdev_release().
    
    Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only
    Reported-by: Christoph Hellwig <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Reviewed-by: Jan Kara <[email protected]>
    Reviewed-by: Christoph Hellwig <[email protected]>
    Tested-by: Yi Zhang <[email protected]>
    Reported-by: https://lore.kernel.org/r/CAHj4cs8tbDwKRwfS1=DmooP73ysM__xAb2PQc6XsAmWR+VuYmg@mail.gmail.com
    Link: https://lore.kernel.org/r/20240315-freibad-annehmbar-ca68c375af91@brauner
    Signed-off-by: Christian Brauner <[email protected]>

It may be related to how we operate with /dev/loop, but needs further investigation to confirm.

Mar 28 '24 17:03 a-nogikh

I've reverted fs,block: get holder during claim and run two syzkallers side by side. Their performance seems to be similar, so so far it does not seem that this commit is the culprit (or it might be one of, and there is something else).

Apr 03 '24 15:04 a-nogikh

But the problem is definitely there and is also well reproducible locally:

two syzkaller instances running side by side for 1 hour: fs/ cover on 6.8 is 9%, while on 6.9 it's just 5%.

Apr 08 '24 11:04 a-nogikh

Manual bisection pointed to this series:

https://lore.kernel.org/all/[email protected]/

Apr 09 '24 08:04 a-nogikh

And the commit that has actually changed the situation for us

321de651fa565dcf76c017b257bdf15ec7fff45d is the first bad commit
commit 321de651fa565dcf76c017b257bdf15ec7fff45d
Author: Christian Brauner <[email protected]>
Date:   Tue Jan 23 14:26:48 2024 +0100

    block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
    
    Make it possible to detected a block device that was opened with
    restricted write access based only on BLK_OPEN_WRITE and
    bdev->bd_writers < 0 so we won't have to claim another FMODE_* flag.
    
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Christian Brauner <[email protected]>

Apr 09 '24 10:04 a-nogikh

Looks like we can no longer reuse the same loop device when some filesystem is still holding a reference to it. As I understand, we do try to unmount mounted filesystems here

https://github.com/google/syzkaller/blob/2b6d9d2d4f7c9240e676ed043867fa382fbc7146/executor/common_linux.h#L4531

But I don't know if there's any way to explicitly wait until the unmount process is complete. Or maybe we miss some somehow.

Maybe it's now easier to move to loopfs? @dvyukov do you remember why don't we use it?

Cc @jankara

Apr 09 '24 11:04 a-nogikh

But I don't know if there's any way to explicitly wait until the unmount process is complete.

remove_dir() should return before the next test is started. If the fs is still mounted when umount returns, it looks like kernel bug. Any script that unmounts and mounts something may fail, if it's the case.

Maybe it's now easier to move to loopfs?

IIRC It did not exist at the time.

A cleaner way may be to create a mount namespace per test, then we don't need remove_dir. Say, if we mount tmpfs as a working dir and then drop it all altogether with the process.

Apr 09 '24 11:04 dvyukov

So remove_dir() actually using umount2(filename, MNT_DETACH | UMOUNT_NOFOLLOW) which means it explicitely asks the kernel to just detach the existing mount from the directory hierarchy but leave the fs mounted until all references to it are dropped. So the device is very likely still mounted when this call returns and this is what kernel has been asked to do. Sadly I don't think there's an easy generic way to wait for a detached filesystem to actually get cleaned up - about the best I can think of is to mount it again (the mount will reuse the same superblock is it still exists) and then unmount it without MNT_DETACH.

Apr 09 '24 12:04 jankara

You are right. I forgot about MNT_DETACH. This is very old code that goes way back to almost first days of syzkaller: https://github.com/google/syzkaller/commit/1e06d2bafc There are mentioned of a number of issues when removing various constructs after the fuzzer, but unfortunately no explanation for detach. I guess fuzzer created some cases where it hangs (e.g. fuse and other networking fs).

Sadly I don't think there's an easy generic way to wait for a detached filesystem to actually get cleaned up

If we want to wait synchronously, then we can just remove MNT_DETACH, right? But I afraid it will create own set of problems/hangs.

I guess mounting a loopfs per test process will be good regardless, it will also eliminate more interference between parallel test processes. This may leak some loop devices, but we leak lots of things anyway.

Apr 09 '24 12:04 dvyukov

Maybe it's now easier to move to loopfs? @dvyukov do you remember why don't we use it?

I see why -- this patch series was never merged to the mainline.

/dev/loop-control has LOOP_CTL_ADD and LOOP_CTL_REMOVE, which seem to be globally visible, so not really suitable for our model of execution. There's also

LOOP_CTL_GET_FREE
              Allocate or find a free loop device for use.  On success,
              the device number is returned as the result of the call.
              This operation takes no argument.

though it's not 100% clear what free means. If free means that no mounted filesystem is currently using the block device, we can try to use LOOP_CTL_GET_FREE in syzkaller -- MNT_DETACH should eventually release all previously allocated loop devices and make them available for reuse.

UPD:

If a loop device stops being free only once mount is done, LOOP_CTL_GET_FREE won't work well for us because several syz-executor processes might get the same loop device id concurrently.

Apr 10 '24 14:04 a-nogikh

Maybe we could just use MNT_FORCE instead of MNT_DETACH?

I'll try and see what it results in.

Apr 10 '24 15:04 a-nogikh

My bet is that MNT_FORCE will hang due to reference leaks. We will also probably not detect it well, e.g. executor will be killed on timeout, but the device is still not free, so fuzzing coverage will again drop. Reference leaks may be the reason why MNT_DETACH works so poorly now, it's not just async, it may be hanging forever (but asynchronously).

Apr 10 '24 15:04 dvyukov

Re LOOP_CTL_GET_FREE: IIRC it has own accounting of free/busy devices, which has nothing to do with actual use of devices. And also any code is free to just use /dev/loopN explicitly and that will conflict with LOOP_CTL_GET_FREE (it can hand out the same N).

Apr 10 '24 16:04 dvyukov

LOOP_CTL_GET_FREE will simply return loop device that is currently not attached to any backing file. There can be time-to-check-time-to-use races but it's a good tip what loop device to try. Also binding loop device to a file requires O_EXCL open these days (it grabs one within ioctl temporarily if open was without O_EXCL) so that is the moment you can rely on - if that succeeds you're guaranteed nobody else has the device mounted. If that fails, you need to go back to LOOP_CTL_GET_FREE and try again.

Apr 10 '24 17:04 jankara

For the record: after #4668 was deployed and some of the noisy v6.9 bugs were fixed, the fs coverage has gone somewhat up: from ~85k to ~135k coverage callbacks.

https://syzkaller.appspot.com/upstream/graph/fuzzing?Instances=ci2-upstream-fs&Metrics=MaxCover&Months=3

Not the 176k that we had before March 2024, but already much better.

Apr 30 '24 14:04 a-nogikh

syzkaller syzkaller copied to clipboard

syzbot: fs coverage dropped in v6.9-rc1

syzkaller
syzkaller copied to clipboard