syzkaller syz-fuzzer: add NULL check in supported features

Kernel supported features are detected using debugfs. However, if the filesystem is not mounted, syz-fuzzer panics without providing any clues as to why.

2024/05/04 10:12:49 connecting to manager...
2024/05/04 10:12:49 fuzzer vm-1 connected
2024/05/04 10:12:49 checking machine...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6019a8]

goroutine 1 [running]:
main.main()
        /home/alessandro/go/src/syzkaller/syz-fuzzer/fuzzer.go:169 +0x958
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 2
debug1: channel 1: free: 127.0.0.1, nchannels 1
Transferred: sent 5180, received 6532 bytes, in 2.5 seconds
Bytes per second: sent 2097.7, received 2645.2
debug1: Exit status 2

This simple patch prevents syz-fuzzer from crashing allowing it to terminate cleanly, while provides a possible cause why this issue is occurring.

2024/05/04 10:15:14 connecting to manager...
2024/05/04 10:15:14 fuzzer vm-1 connected
2024/05/04 10:15:14 checking machine...
2024/05/04 10:15:14 SYZFATAL: The currently running kernel image seems not to support any required feature, have you forgotten to mount debugfs?
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 2
debug1: channel 1: free: 127.0.0.1, nchannels 1
Transferred: sent 5160, received 5016 bytes, in 2.4 seconds
Bytes per second: sent 2106.7, received 2047.9
debug1: Exit status 1

May 04 '24 10:05 alessandrocarminati

Hi Alessandro,

Features are not returned as nil when debugfs is not mounted, the only case I see where we return nil features is this: https://github.com/google/syzkaller/blob/610f2a54d02f8cf4f2454c03bf679b602e6e59b6/pkg/host/features.go#L89 Are you trying to enable coverage for OS that does not use shmem for executor communication? We should disable coverage feature in this case with the explanation, or detect this setting earlier then syz-manager starts.

May 06 '24 06:05 dvyukov

I hope you will forgive my unconventional debugging approach. Additionally, please bear in mind my limited knowledge of the code, and the tool in general. I recently encountered a syzkaller kernel bug in my backlog, and to reproduce it, I decided to delve into syzkaller from scratch.

My initial step was to set up a Linux environment and a filesystem for learning purposes. Here are the specifications of my testing environment:

Architecture: host: x86_64, target: aarch64
Default buildroot-2024.02.1 rootfs
Linux-6.8.9 built for aarch64, using the default configuration along with suggestions. I also introduced a simple bug to ensure timely results.

My initial syzkaller run resulted in a panic.

2024/05/04 10:12:49 connecting to manager...
2024/05/04 10:12:49 fuzzer vm-1 connected
2024/05/04 10:12:49 checking machine...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6019a8]

goroutine 1 [running]:
main.main()
        /home/alessandro/go/src/syzkaller/syz-fuzzer/fuzzer.go:169 +0x958

By inserting print debugs in various locations, similar to what is demonstrated here:

diff --git a/syz-fuzzer/fuzzer.go b/syz-fuzzer/fuzzer.go
index 15b27bb86..e7dd2640a 100644
--- a/syz-fuzzer/fuzzer.go
+++ b/syz-fuzzer/fuzzer.go
@@ -147,16 +147,23 @@ func main() {
        }
        var checkReq *rpctype.CheckArgs
        if r.Features == nil {
+               fmt.Println("r.Features == nil")
                checkArgs.featureFlags = featureFlags
                checkReq, err = checkMachine(checkArgs)
                if err != nil {
+                       fmt.Println("    r.Features == nil && err != nil")
+                       fmt.Printf("   checkReq=%v, errstr=\"%s\"\n", checkReq,err.Error())
                        if checkReq == nil {
+                               fmt.Println("        r.Features == nil && err != nil && checkReq == nil")
                                checkReq = new(rpctype.CheckArgs)
+                               fmt.Printf("        checkReq =%v\n", checkReq)
                        }
                        checkReq.Error = err.Error()
+                       fmt.Printf("   checkReq.Erro=\"%s\"\n", err.Error())
                }
                r.Features = checkReq.Features
        } else {
+               fmt.Println("r.Features != nil")
                if err = host.Setup(target, r.Features, featureFlags, executor); err != nil {
                        log.SyzFatal(err)
                }

I observed that:

2024/05/06 07:06:10 connecting to manager...
r.Features == nil
2024/05/06 07:06:10 checking machine...
    r.Features == nil && err != nil
   checkReq=<nil>, errstr="coverage is not supported (CONFIG_KCOV is not enabled)"
        r.Features == nil && err != nil && checkReq == nil
        checkReq =&{  <nil> map[] []}
   checkReq.Erro="coverage is not supported (CONFIG_KCOV is not enabled)"

Since I was confident that my features were enabled int the kernel, I investigated how syzkaller detected these features. My conclusion was that syzkaller couldn't identify the enabled features because debugfs wasn't mounted.

Creating a new buildroot image with debugfs mounted prevented the panic. Consequently, I inferred that syzkaller expected debugfs to be mounted.

Have I overlooked anything significant in my analysis?

May 06 '24 07:05 alessandrocarminati

Yes, use of KCOV requires DEBUGFS, but syzkaller shouldn't crash if DEBUGFS is not mounted.

I see, we get nil r.Features if we get here: https://github.com/google/syzkaller/blob/610f2a54d02f8cf4f2454c03bf679b602e6e59b6/syz-fuzzer/fuzzer.go#L154

and it crashes when printing here: https://github.com/google/syzkaller/blob/610f2a54d02f8cf4f2454c03bf679b602e6e59b6/syz-fuzzer/fuzzer.go#L166

We probably could remove printing entirely, but I guess it will still crash here: https://github.com/google/syzkaller/blob/610f2a54d02f8cf4f2454c03bf679b602e6e59b6/syz-fuzzer/fuzzer.go#L180

This part of the code is currently in flux and will need to change more significantly, but a reasonable local fix may be to do this:

checkReq = &rpctype.CheckArgs{
	Features: new(host.Features),
}

here: https://github.com/google/syzkaller/blob/610f2a54d02f8cf4f2454c03bf679b602e6e59b6/syz-fuzzer/fuzzer.go#L154C5-L154C38

then we shouldn't crash and pass the error to the manager to print to user.

May 06 '24 07:05 dvyukov

I understand. Since debugfs mounting is not a requirement, my initial assumption may not be accurate. Furthermore, I've confirmed that simply disabling the feature print does not resolve the panic issue. I've tested it myself, and the crash still occurs shortly after.

Given this, should I amend my PR with a more suitable solution, such as testing the one you proposed, or would it be more appropriate to close this PR altogether?

What are your thoughts?

May 06 '24 08:05 alessandrocarminati

Whatever you prefer. A fix that prevents r.Features from being nil in the first place looks reasonable to merge.

May 06 '24 08:05 dvyukov

Please don't misunderstand me; I'm only asking because I see this as an opportunity for me to delve deeper into this code.

I've tested the suggestion from your previous message, and here are the results:

2024/05/06 08:23:11 connecting to manager...
2024/05/06 08:23:11 checking machine...
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 : 
2024/05/06 08:23:11 starting 1 executor processes
2024/05/06 08:23:11 machine check failed: coverage is not supported (CONFIG_KCOV is not enabled)
2024/05/06 08:23:11 SYZFATAL: Manager.Check call failed: machine check failed: coverage is not supported (CONFIG_KCOV is not enabled)

Indeed, it doesn't panic, which is a positive outcome.

However, it provides a message that might confuse the user:

Coverage is not supported (CONFIG_KCOV is not enabled), which is not accurate; In this scenario, CONFIG_KCOV is enabled, and the user is simply using an image that doesn't mount debugfs.

Moreover, the final message the user receives indicates that syzkaller exited because the KCOV feature isn't enabled in the kernel, suggesting it's a requirement.

2024/05/06 08:23:11 SYZFATAL: Manager.Check call failed: machine check failed: coverage is not supported (CONFIG_KCOV is not enabled)

However, based on our interaction, I understood KCOV isn't mandatory for syzkaller to function. Looking at syz-manager/rpc.go, where the error is generated, it appears that it considers any rpctype.CheckArgs reported error as fatal, is this the intended behavior?

May 06 '24 08:05 alessandrocarminati

If you want better error message, then it should be done here: https://github.com/google/syzkaller/blob/69f2eab004cdc5bce339d5359dcf234698153dc7/pkg/host/features_linux.go#L45-L56

I see it already checks for debugfs, so perhaps this function needs improvement: https://github.com/google/syzkaller/blob/69f2eab004cdc5bce339d5359dcf234698153dc7/pkg/host/features_linux.go#L246-L251

Using if suppFeatures == nil { to report this error is wrong, it may be nil for other reasons, and it may be not nil when debugfs is not mounted.

May 06 '24 11:05 dvyukov

Hello again, I've taken a bit of time to reflect on the issue that I still have doubts about. In the fuzzer, there's this nice check that appears to be a placeholder or preparation for something that isn't there yet. In fact, if you look at the checkMachine() function, you'll notice that if it succeeds, it returns a non-null string and a null error; if it fails, it returns a null string and a non-null error. Currently, there's no case where both are non-null. Another aspect that gives me pause for thought is the design of checkMachine(). It's structured in a way that if just one of the features is missing, it gives up without checking the others, and KCOV is the first check. This suggests to me that the designer might have considered the features mandatory... Probably I'm overthinking.

In any case, if you manually provide an empty feature object as I interpret your suggestion in this thread, the fuzzer doesn't crash but eventually terminates, stating that KCOV is not enabled. So, assuming syzkaller, as you said, should work without KCOV, I attempted to reverse engineer the piece of code responsible for the premature termination. It seems that checkReq plays a crucial role, as the manager appears to inspect the error string field in there and terminate execution if anything other than an empty string is found.

In an effort to force execution without KCOV detection, I removed the code that populates the error field in checkReq. However, subsequent code still attempted to utilize KCOV and terminated VM execution upon failure.

syscalls                : 12/4407
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :
                        :

2024/05/07 16:03:58 corpus                  : 44 (0 broken, 0 seeds)
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:57 fuzzer detected executor failure='executor 0: EOF
', retrying #1
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:57 fuzzer detected executor failure='executor 0: EOF
', retrying #2
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:58 fuzzer detected executor failure='executor 0: EOF
', retrying #3
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:58 fuzzer detected executor failure='executor 0: EOF
', retrying #4
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:58 fuzzer detected executor failure='executor 0: EOF
', retrying #5
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:58 fuzzer detected executor failure='executor 0: EOF
', retrying #6
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:58 fuzzer detected executor failure='executor 0: EOF
', retrying #7
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:58 fuzzer detected executor failure='executor 0: EOF
', retrying #8
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:59 fuzzer detected executor failure='executor 0: EOF
', retrying #9
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:59 fuzzer detected executor failure='executor 0: EOF
', retrying #10
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:59 fuzzer detected executor failure='executor 0: EOF
', retrying #11
SYZFAIL: open of /sys/kernel/debug/kcov failed
 (errno 2: No such file or directory)
2024/05/07 16:03:59 fuzzer detected executor failure='executor 0: EOF
', retrying #12
2024/05/07 16:03:59 SYZFATAL: executor 0 failed 11 times: executor 0: EOF

debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 2
debug1: channel 1: free: 127.0.0.1, nchannels 1
Transferred: sent 6892, received 26440 bytes, in 17.6 seconds
Bytes per second: sent 392.6, received 1506.3
debug1: Exit status 1

I might be about to say something very wrong, so please forgive me for that, but seems like this code is written assuming KCOV.

In the meantime, I've rewritten the checkDebugFS() function to make it more robust, if you consider this worthy, I can open another PR for this.

index 393617e45..ffd7d66d3 100644
--- a/pkg/host/features_linux.go
+++ b/pkg/host/features_linux.go
@@ -244,10 +244,14 @@ func checkVhciInjection() string {
 }
 
 func checkDebugFS() string {
-       if err := osutil.IsAccessible("/sys/kernel/debug"); err != nil {
-               return "debugfs is not enabled or not mounted"
-       }
-       return ""
+        var stat syscall.Statfs_t
+        if err := syscall.Statfs("/sys/kernel/debug", &stat); err != nil {
+                return err.Error()
+        }
+        if stat.Type != 0x64626720 { //DEBUGFS_MAGIC
+                return "debugfs not mounted"
+        }
+        return ""
 }
 
 func checkKCSAN() string {

Now is able to detect if debugfs is mounted more accurately, but for the overall issue it makes no difference, since having debugfs mounted or not wasn't apparently a blocking issue. Any thoughts?

May 07 '24 16:05 alessandrocarminati

In the meantime, I've rewritten the checkDebugFS() function to make it more robust, if you consider this worthy, I can open another PR for this.

As I mentioned this part is being changed almost completely. I finally got it to PR stage: #4790. Overall better diagnostics for end users are good, but the change will need to fit into the new architecture.

May 15 '24 09:05 dvyukov

Hello,

I took a quick look at the changes you made to the codebase, and it has indeed changed significantly since the last time I reviewed it. Given these updates, my previous proposals are no longer relevant, so please feel free to close this PR as it is no longer applicable.

Allow me to make one final remark: In my test rootfs, where debugfs is not mounted, the code no longer crashes and terminates cleanly with the following statement:

2024/05/17 07:41:39 [FATAL] check failed: coverage is not supported: only 0 calls are executed out of 3

Although it is possible to have kcov missing while debugfs is mounted, accessing it from userspace requires debugfs to be mounted. Since kcov is necessary, it is the user's responsibility to provide it. I agree that it is not strictly necessary to check whether debugfs is mounted or not. Does it make sense to you to introduce code that tries to mount debugfs if not mounted, or we can just conclude a smart user can figure out what the problem is, and help himself?

May 17 '24 08:05 alessandrocarminati

I think we could try to mount debugfs during setup procedure in executor in setup_features. But we need to mount it before any features are checked, not just coverage, because debugfs is used by a number of features. We shouldn't fail if the mount fails and instead rely on the check for each feature. Additionally we could now also improve error message in coverage feature check.

May 21 '24 05:05 dvyukov

syzkaller syzkaller copied to clipboard

syz-fuzzer: add NULL check in supported features

syzkaller
syzkaller copied to clipboard