syzkaller icon indicating copy to clipboard operation
syzkaller copied to clipboard

sys/linux: generating ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) may break VM

Open ramosian-glider opened this issue 1 year ago • 22 comments

The following program:

r0 = openat$kvm(0x0, &(0x7f0000000040), 0x0, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
r3 = mmap$KVM_VCPU(&(0x7f0000009000/0x1000)=nil, 0x930, 0x280000f, 0x11, r2, 0x0)
syz_memcpy_off$KVM_EXIT_HYPERCALL(r3, 0x20, &(0x7f00000001c0)="fb4149dd033be3ac2cc4a22332a77b23b08986814d7bb14c94a6ab8031d1dfd92f00000000010000005a9610fbff67521ce16f8f1f449a7a835673312b54ebb2aa7fc869d22627e7", 0x0, 0x48)
mmap$KVM_VCPU(&(0x7f0000000000/0xa000)=nil, 0x930, 0x1000001, 0x11, r2, 0x0)
openat$kvm(0x0, &(0x7f0000000040), 0x0, 0x0)
openat$kvm(0xffffff9c, &(0x7f0000000040), 0x0, 0x0)
mmap$KVM_VCPU(&(0x7f0000000000/0x14000)=nil, 0x930, 0x0, 0x5c1fd1b656592f1, 0xffffffffffffffff, 0x0)
mmap$KVM_VCPU(&(0x7f0000001000/0x2000)=nil, 0x930, 0x2000003, 0x4120932, 0xffffffffffffffff, 0x0)
openat$kvm(0xffffffffffffff9c, &(0x7f0000000080), 0x0, 0x0)
ioctl$KVM_IRQFD(0xffffffffffffffff, 0x4020ae76, &(0x7f0000000140)={0xffffffffffffffff, 0x6})
r4 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000080), 0x0, 0x0)
r5 = ioctl$KVM_CREATE_VM(r4, 0xae01, 0x0)
ioctl$KVM_CREATE_DEVICE(r5, 0xc00caee0, &(0x7f0000000140)={0x4, <r6=>0xffffffffffffffff, 0x1})
ioctl$KVM_SET_DEVICE_ATTR(r6, 0x40086602, &(0x7f0000000040)={0x5cfe, 0x0, 0x0, 0x0})

breaks fuzzing in a subtle way, setting a number of root filesystems attributes that prevent syz-executor from creating and deleting files:

# lsattr -d /
-uS-iadAc-j---------

This happens because the last call, ioctl$KVM_SET_DEVICE_ATTR(r6, 0x40086602, &(0x7f0000000040)={0x5cfe, 0x0, 0x0, 0x0}), has an incorrect ioctl number corresponding to FS_IOC_GETFLAGS, and because it happens to be called on a file descriptor somehow pointing at the root filesystem.

The latter part is not clear to me yet. The fd in question has a value of 6, and is supposed to be returned by KVM_CREATE_DEVICE, but could as well be some fd created by the executor.

I'll dig into this, but anyway I believe we need to disallow mutating ioctl numbers to value 0x40086602, unless that's FS_IOC_GETFLAGS.

ramosian-glider avatar Jun 26 '24 16:06 ramosian-glider

0x40086602 is FS_IOC_SETFLAGS:

sys/linux/fs_ioctl.txt.const:FS_IOC_SETFLAGS = 1074292226, 386:arm:1074030082, mips64le:ppc64le:2148034050

Related to #477 and https://github.com/dvyukov/syzkaller/commit/cd4daf14452ee65caaf268f92e3242e338647533

dvyukov avatar Jun 27 '24 06:06 dvyukov

0x5cfe maps to these attributes:

#define	FS_UNRM_FL			0x00000002 /* Undelete */
#define	FS_COMPR_FL			0x00000004 /* Compress file */
#define FS_SYNC_FL			0x00000008 /* Synchronous updates */
#define FS_IMMUTABLE_FL			0x00000010 /* Immutable file */
#define FS_APPEND_FL			0x00000020 /* writes to file may only append */
#define FS_NODUMP_FL			0x00000040 /* do not dump file */
#define FS_NOATIME_FL			0x00000080 /* do not update atime */
#define FS_NOCOMP_FL			0x00000400 /* Don't compress */
#define FS_ENCRYPT_FL			0x00000800 /* Encrypted file */
#define FS_BTREE_FL			0x00001000 /* btree format dir */
#define FS_INDEX_FL			0x00001000 /* hash-indexed directory */
#define FS_JOURNAL_DATA_FL		0x00004000 /* Reserved for ext3 */

dvyukov avatar Jun 27 '24 06:06 dvyukov

Yeah, sorry, I mistyped the name. It is FS_IOC_SETFLAGS that is harmful.

ramosian-glider avatar Jun 27 '24 06:06 ramosian-glider

As far as I understand, it's still possible for the fuzzer to accidentally random ioctl constants that don't match the discrimination.

Shall I handle it in Neutralize() in sys/targets/common.go?

ramosian-glider avatar Jun 27 '24 06:06 ramosian-glider

Neutralizing it is the last resort, since then we won't ever test it on any filesystem. And it seems to be doing lots of complex things that are fs specific.

We can try to attack this from different angles:

  • prevent it from opening / (which is shouldn't)
  • always mount tmpfs and hide root fs completly

We already prohibit a bunch of ioctls for similar reasons (FIFREEZE, EXT4_IOC_SHUTDOWN, EXT4_IOC_RESIZE_FS), so if we hide root fs, we could re-enable these as well.

I am trying to understand how it gets access to /, but I can't reproduce it, I always get:

#0 [1304ms] -> ioctl$KVM_SET_DEVICE_ATTR(0xffffffffffffffff, 0x40086602, 0x20000040)
#0 [1304ms] <- ioctl$KVM_SET_DEVICE_ATTR=0xffffffffffffffff errno=9

Is it reliably reproducible for you? If so, please run with -debug flag to see what fd it gets.

dvyukov avatar Jun 27 '24 07:06 dvyukov

Neutralizing it is the last resort, since then we won't ever test it on any filesystem. And it seems to be doing lots of complex things that are fs specific.

I was going to do something along the lines of:

+               if c.Args[1].(*prog.ConstArg).Val == arch.FS_IOC_SETFLAGS &&
+                       c.Meta.Name != "ioctl$FS_IOC_SETFLAGS" {
+                       c.Args[1].(*prog.ConstArg).Val = arch.FS_IOC_GETFLAGS
+               }

, i.e. only neutralize if the ioctl number doesn't match its discrimination.

This won't solve the problem of generating a perfectly valid program doing ioctl$FS_IOC_SETFLAGS() though.

Is it reliably reproducible for you? If so, please run with -debug flag to see what fd it gets.

Yeah, I'm on it.

FWIW there's also a similar problem with FS_IOC_FSSETXATTR

ramosian-glider avatar Jun 27 '24 07:06 ramosian-glider

This won't solve the problem of generating a perfectly valid program doing ioctl$FS_IOC_SETFLAGS() though.

Exactly, so it won't solve the problem. It's even easier for the fuzzer to do the same using ioctl$FS_IOC_SETFLAGS.

dvyukov avatar Jun 27 '24 07:06 dvyukov

log.txt

Here's the log from my VM. I am trying to reduce the repro now, but it is extremely fragile. I suppose it also depends on one of the FDs opened by the executor.

ramosian-glider avatar Jun 27 '24 08:06 ramosian-glider

Fun. This indeed works:

#include <fcntl.h>
#include <unistd.h>

int main() {
	int fds[2];
	pipe(fds);
	openat(fds[1], "/dev/kvm", 0, 0);
	sleep(1000);
}

What is this fd?

$ cat /proc/890477/fdinfo/5
pos:	0
flags:	0100000
mnt_id:	25
ino:	724

dvyukov avatar Jun 27 '24 08:06 dvyukov

It looks like it's just "/dev/kvm".

But in today's episode of Fun With Fuzzers:

0x5cfe returns EINVAL for this ioctl, but it looks like it passed memory mmaped from kvm vcpu to open openat and the ioctl :) So we don't know the actual flags value, nor the opened file name. If the mmaped vcpu happened to have /\x00 at that location, then it could open root.

#0 [60126ms] -> mmap$KVM_VCPU(0x20000000, 0x930, 0x1000001, 0x11, 0x5, 0x0)
#0 [60137ms] <- mmap$KVM_VCPU=0x20000000
SIGSEGV on 0x20000040, skipping
#0 [60149ms] -> openat$kvm(0x0, 0x20000040, 0x0, 0x0)
#0 [60157ms] <- openat$kvm=0x6
...
#0 [60269ms] -> ioctl$KVM_SET_DEVICE_ATTR(0x6, 0x40086602, 0x20000040)

dvyukov avatar Jun 27 '24 08:06 dvyukov

The mmaped memory most likely has 0s (at least among values that can possibly succeed when being using as a file name in openat), but opening an empty string from pipe dir fails, so that wasn't that :)

pipe2([3, 4], 0)                        = 0
openat(4, "", O_RDONLY)                 = -1 ENOENT (No such file or directory)

dvyukov avatar Jun 27 '24 08:06 dvyukov

Note that mmap() calls are requesting 0x930 bytes, which the executor probably rounds up, but prog2c happily leaves as is (trying a C repro now)

ramosian-glider avatar Jun 27 '24 08:06 ramosian-glider

Running a slightly modified C repro under strace:

...
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
write(1, "executing program\n", 18executing program
)     = 18
openat(0, "/dev/kvm", O_RDONLY)         = 3
ioctl(3, KVM_CREATE_VM, 0)              = 4
ioctl(4, KVM_CREATE_VCPU, 0)            = 5
mmap(0x20009000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC|PROT_SEM|PROT_GROWSUP|0x800000, MAP_SHARED|MAP_FIXED, 5, 0) = 0x20009000
mmap(0x20000000, 4096, PROT_READ|PROT_GROWSDOWN, MAP_SHARED|MAP_FIXED, 5, 0) = 0x20000000
newfstatat(1, "", {st_mode=S_IFCHR|0600, st_rdev=makedev(0xcc, 0x40), ...}, AT_EMPTY_PATH) = 0
ioctl(1, TCGETS, {c_iflag=BRKINT|ICRNL|IXON|IMAXBEL, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD|HUPCL|CLOCAL, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
write(1, "path: '/' (0x1000000002f)\n", 26path: '/' (0x1000000002f)
) = 26
openat(AT_FDCWD, "/", O_RDONLY)         = 6
mmap(0x20000000, 4096, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_NONBLOCK|MAP_EXECUTABLE|MAP_HUGETLB|0x16002c0|25<<MAP_HUGE_SHIFT, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x20001000, 4096, PROT_READ|PROT_WRITE|PROT_GROWSUP, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_GROWSDOWN|MAP_DENYWRITE|MAP_STACK|MAP_FIXED_NOREPLACE|1<<MAP_HUGE_SHIFT, -1, 0) = 0x20001000
ioctl(-1, KVM_IRQFD, 0x20000140)        = -1 EBADF (Bad file descriptor)
openat(AT_FDCWD, "/dev/kvm", O_RDONLY)  = 7
ioctl(7, KVM_CREATE_VM, 0)              = 8
ioctl(8, KVM_CREATE_DEVICE, 0x20000140) = 0
ioctl(6, FS_IOC_SETFLAGS, [FS_UNRM_FL|FS_COMPR_FL|FS_SYNC_FL|FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL|FS_NOATIME_FL|FS_NOCOMP_FL|FS_ENCRYPT_FL|FS_INDEX_FL|FS_JOURNAL_DATA_FL]) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

so the mapped CPU data contains 0x1000000002f, which is indeed treated as "/\0". Essentially we're doing the following:

  1. Set up a VM with a virtual CPU that does nothing, but provides some static data in its struct kvm_run.
  2. Map that CPU's struct kvm_run at address 0x20000000 with PROT_READ, so that it overlaps with syzlang's heap.
  3. Attempt to write "/dev/kvm" at address 0x20000040 and fail, because the memory is protected.
  4. Call openat(AT_FDCWD, "/", O_RDONLY) instead of opening /dev/kvm, because struct kvm_run has 0x1000000002f at address 0x20000040.
  5. Remap address 0x20000000 with PROT_WRITE, so that program arguments can be stored in it again.
  6. Prepare the arguments for KVM_IRQFD at address 0x20000140, placing the totally random value 6 at address 0x20000144. This value coincidentally happens to be the value returned by openat() at step 4.
  7. Call ioctl(-1, KVM_IRQFD, 0x20000140), which does nothing interesting (we only wanted to write 6 to address 0x20000144).
  8. Create a new VM and a virtual device, initializing a struct kvm_device at address 0x20000140, so that the output kvm_device.fd parameter is located at address 0x20000144.
  9. Take fd (which is 6) from the output parameter of ioctl(KVM_CREATE_DEVICE) and pass it to ioctl(FS_IOC_SETFLAGS), breaking the file system.

ramosian-glider avatar Jun 27 '24 09:06 ramosian-glider

so the mapped CPU data contains 0x1000000002f, which is indeed treated as "/\0".

This is hilarious.

I don't think we can prevent all such cases, fuzzer came with other creative ways to break attempts to restrict it. It's more reliable to prevent these things on sandbox level. I think we should mount tmpfs as rootfs for sandbox=none and then re-enable all prohibited syscalls.

dvyukov avatar Jun 27 '24 09:06 dvyukov

We could also do a couple more things to reduce the risk of such cases.

First, it makes little sense to call discriminated versions of openat() with random string parameters, so maybe we need to allocate them in a region that can't be mprotect()ed or mmap()ed by the program.

Second, we'd better not reuse addresses for structs allocated for different syscalls. Given the typical program size, we can afford placing them at a new address every time. In return we'll have more reproducible results that are easier to minimize (calls do not implicitly depend on each other). An alternative would be to explicitly zero out a struct before initializing its fields.

ramosian-glider avatar Jun 27 '24 10:06 ramosian-glider

All of this will also reduce fuzzer's ability to find bugs (passing device mmaped memory as a file name looks like a good corner case to test), so it's a tough choice.

dvyukov avatar Jun 27 '24 10:06 dvyukov

Does having overlapping structs really help much? To me it looks more like a loophole that allows the fuzzer to work around resource type checks by writing something into one struct as an int and then reading it back as an fd.

ramosian-glider avatar Jun 27 '24 10:06 ramosian-glider

I think we try to avoid overlapping structs as much as possible, where it's reasonably easy to do.

dvyukov avatar Jun 27 '24 12:06 dvyukov

My current intent is to factor out the ./syz-tmp bits from namespace_sandbox_proc() and make sure every sandbox uses them.

ramosian-glider avatar Jun 27 '24 14:06 ramosian-glider

My current intent is to factor out the ./syz-tmp bits from namespace_sandbox_proc() and make sure every sandbox uses them.

This would be good for sandbox=none, but I am not sure about sandbox=setuid. sandbox=none can mount own filesystems of all types, so it can test them. sandbox=setuid can't so it will test just this tmpfs, and won't test the actual rootfs at all. But this looks important -- the reachable surface is already pretty small under sandbox=setuid, and executing open/read/write/ftruncate on the actual rootfs (e.g. ext4) is a significant part of that surface. If we do tmpfs in this case, we won't test what an unpriv user can do with the root fs. But sandbox=setuid also totally shouldn't be able to change rootfs attrs (if it manages to get same failures woth broken footfs, we already found a very serious bug).

dvyukov avatar Jun 27 '24 15:06 dvyukov

Hmm, some tests start failing when I am trying to mount tmpfs in do_sandbox_none(). I suppose this is because we don't run the tests under root.

I could do unshare(CLONE_NEWUSER) to pretend I am root, but I doubt I can elevate the capabilities for mounting tmpfs in the tests.

I could as well skip the tmpfs part for non-root users, but it will remain untested then, which is undesirable.

ramosian-glider avatar Jun 28 '24 09:06 ramosian-glider

Oh wait, there's CAP_LINUX_IMMUTABLE that we can drop in order to prevent setting FS_IMMUTABLE_FL and FS_APPEND_FL. This won't stop the fuzzer from messing up with other attributes, but at least it won't break the basic functionality like file/directory creation.

ramosian-glider avatar Jun 28 '24 10:06 ramosian-glider

Yet another angle to tackle this (e.g. to make less harmful attributes do not persist after an executor run) would be to do chattr -R = / on executor startup. I still think we need to drop CAP_LINUX_IMMUTABLE to make sure temporary files are deleted.

ramosian-glider avatar Jul 01 '24 07:07 ramosian-glider

Yet another angle to tackle this (e.g. to make less harmful attributes do not persist after an executor run) would be to do chattr -R = / on executor startup.

What executor process do you mean? There are usually 4 hierarchical processes now.

I still think we need to drop CAP_LINUX_IMMUTABLE to make sure temporary files are deleted.

You mean entirely and don't test it on any filesystems?

dvyukov avatar Jul 01 '24 07:07 dvyukov

What executor process do you mean? There are usually 4 hierarchical processes now.

I was going to suggest doing that around makeCommand() in pkg/ipc/ipc.go, but apparently it is gone now :)

We need to reset the attributes before preparing to every program execution, what would be the best place to do so?

You mean entirely and don't test it on any filesystems?

That was my intent, right. Given that we cannot reliably mount tmpfs (see above), this might be the lesser evil. It will still be possible to set other attributes, except for FS_IMMUTABLE_FL and FS_APPEND_FL.

ramosian-glider avatar Jul 01 '24 08:07 ramosian-glider

We need to reset the attributes before preparing to every program execution, what would be the best place to do so?

This looks expensive. We would need to read every dir on disk and read attributes of every file? While a test can take just 40 ms.

dvyukov avatar Jul 01 '24 08:07 dvyukov

Given that we cannot reliably mount tmpfs (see above)

Do you mean this one?

Hmm, some tests start failing when I am trying to mount tmpfs in do_sandbox_none(). I suppose this is because we don't run the tests under root.

What tests have failed? Were they for linux, or test OS? For which OSes have you changed sandbox setup?

dvyukov avatar Jul 01 '24 08:07 dvyukov

What tests have failed? Were they for linux, or test OS? For which OSes have you changed sandbox setup?

I don't see where we run test programs for Linux OS in tests. We build executor for Linux in pkg/runtest/executor_test.go, but there we only run executor unit-tests "syz-executor test", it shouldn't do sandbox setup.

dvyukov avatar Jul 01 '24 08:07 dvyukov

I think there were failing tests in pkg/ipc on Linux, but those are gone now. Let me update my setup, I am pretty sure even the original bug is going to manifest in a different way now (if at all)

ramosian-glider avatar Jul 01 '24 08:07 ramosian-glider

If mount was failing, changing attribute of every file on the disk will fail too, right. Both things are not something that tests should be doing.

But I had memories that mounting tmpfs in unpriv operation nowadays.

dvyukov avatar Jul 01 '24 09:07 dvyukov