syzkaller syzbot: lots of `SYZFAIL: ebtable checkpoint: socket(AF_INET, SOCK_STREAM, IPPROTO

Context

In the upstream Linux namespace, the SYZFAIL: ebtable checkpoint: socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) crash has been responsible for up to 10% of daily syzbot crashes.

Its frequency recently dropped by 10x, but then it went up again (though it's still less frequent than it used to be).

Likely it was because of the other frequent net fuzzing crasher - unregister_netdevice: waiting for DEV to become free got fixed in the net tree. Since that moment, there's a surge of SYZFAIL: ebtable checkpoint: socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) crashes on ci-upstream-net-this-kasan-gce again.

In any case, we should figure out whether it's a manifestation of a real kernel bug or our it's our checkpoint code in syz-executor must be changed.

The reproducers look similar:

#{"repeat":true,"procs":1,"slowdown":1,"sandbox":"none","tun":true,"netdev":true,"resetnet":true,"cgroups":true,"binfmt_misc":true,"close_fds":true,"usb":true,"vhci":true,"wifi":true,"ieee802154":true,"sysctl":true,"tmpdir":true,"segv":true}
bpf$BPF_TASK_FD_QUERY(0x14, &(0x7f0000000080)={0x0, 0xffffffffffffffff, 0x0, 0x7, &(0x7f0000000000)='cgroup\x00'}, 0x30)
r0 = bpf$PROG_LOAD(0x5, &(0x7f0000000080)={0x9, 0x4, &(0x7f0000000700)=ANY=[@ANYBLOB="18000000000000000000000000000000611200000000000095000000000000001383096e16281fd43e588cf7a1e65f316e5e5600f1fb642cb352b9d4c50ae8366e5cadf97f4e52fdb37bdab01f9f6cc297b10500c98ea973fbaf38f9d47c5702c2bd9ebf0134b54dbee7458404277462d8ac80053e629d28aa5b25e324fd54d237d7921ff7b52f78ad9692619113594630a9eb6490c61332499f4861a57120ea351e61ca79b452a2bffd133c9ce1b4049b537a6310d0ee13db80ad6553ed19a04679d0d66bf61277501f370105113bd565ae2e766f9a79e314ecbc4000b4702ecfcaed9cb384edf20b1d3e7011bd384577a5a78efdd8687e0574465e490aa62e217fa49e4167d7edcd030c20937155d065ee7bb686bffcf28ec73d58a1d795c358c5aee99cae4c959ba2b9a78b4e231c46f8030523faf5b79ef84c5201a69d776df2041ae3d19a3d03fb1f2913fdd3fef24c94f1e224f872c1bebc0a7622231b2be88508a13a5b74e417cdad2076dd0ccdf44daf7404337f84783856b8582065669a46c1d570cdf4d6ce259d39fdc6bc4f066eb27ba18fc0110ebf3eb081d09b8587c911260c2ca2f49825e10b20733735ec2f4a80308c92dac2cac1608cbd739d385703e2933fda0dde43f3270d7170a7f5ce1dad0a2ae4691cc8487e113b89df89fd1d3c51723d79966e8c2eae12cf2dcfa7c09b15de3f494c5bfc35a8ac8124fb66066b2b3c7db6585b2fe802e86d2794d885c779de4ab1a0999fcedaea0b0497927b536e120212681673509f2aa7ad0875d5be6fa5f5812dd7966978f435924026737b78156906c3faf9e84f0cfb70a8d326262ce7ceadf4f95a7afb2bdf8250af753f32"], &(0x7f0000000100)='GPL\x00', 0x0, 0x0, 0x0, 0x0, 0x0, '\x00', 0x0, 0x0, 0xffffffffffffffff, 0x8, 0x0, 0x0, 0x10, 0x0}, 0x70)
r1 = openat$cgroup_root(0xffffffffffffff9c, &(0x7f0000000000), 0x200002, 0x0)
bpf$BPF_PROG_DETACH(0x8, &(0x7f0000000040)={@cgroup=r1, r0, 0x2}, 0x10)

I was unable to crash v6.15-rc2 kernel running on qemu using the reproducer above.

Progress

In another related discussion, @FlorentRevest has concluded that the bpf program included in the reproducer is just returning 0.

Apr 17 '25 19:04 a-nogikh

A local patched qemu-based instance is able to trigger the bug.

After some debugging, it seems that the socket() call is rejected after calling BPF_CGROUP_RUN_PROG_INET_SOCK(sk) here: https://elixir.bootlin.com/linux/v6.15-rc1/source/net/ipv4/af_inet.c#L391

So it does look indeed as if it's the fuzzer who prohibits the socket creation.

Apr 17 '25 20:04 a-nogikh

Should we replace the socket() call by syz_init_net_socket() or something similar?

Creating a cgroup namespace per test seems like an overkill. On the other hand, if we are able to prohibit such basic syscalls for the rest of the proc lifetime, there's little value in letting it run further.

Cc @dvyukov

Apr 17 '25 20:04 a-nogikh

Should we replace the socket() call by syz_init_net_socket() or something similar?

The socket should be in the test net namespace to reset the right state.

Creating a cgroup namespace per test seems like an overkill.

Perhaps we could re-create it as part of the sandbox creation. Before we produce SYZFAIL we should try to re-create sandbox several times.

Another possible option is to mark some set of syscalls that have global dangerous effects as "snapshot-mode only", and test them only on in snapshot mode. We already have a bunch of issues with perf, and create separate instances just for these to achieve similar effect. There is also a bunch of syscalls that we simply disable entirely, or not describe, for similar reasons (e.g. the only in program sanitization). See #5308.

Apr 22 '25 06:04 dvyukov

Perhaps we could re-create it as part of the sandbox creation.

You mean calling setup_cgroups() not in the runner executor process before creating the procs, but as a part of the sandboxing setup of each proc? Or do you mean something else here?

There is also a bunch of syscalls that we simply disable entirely, or not describe, for similar reasons (e.g. the only in program sanitization).

Then it looks like we'd need to ban all of BPF_PROG_TYPE_CGROUP_* bpf program types: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_CGROUP_SOCK/

or sanitize BPF_PROG_ATTACH to never attach to a EXP_CGROUP.

Apr 22 '25 09:04 a-nogikh

You mean calling setup_cgroups() not in the runner executor process before creating the procs

Probably not as simple as moving the call, but, yes, something like that.

or sanitize BPF_PROG_ATTACH to never attach to a EXP_CGROUP.

Is it the worst type of BPF hooks? It feels that any global hooks should be much worse than anything attached to a single cgroup.

Apr 22 '25 12:04 dvyukov

Is it the worst type of BPF hooks?

I don't know. In the documentation, they also mention LSM hooks, these also sound like they may have global consequences.

Apr 22 '25 12:04 a-nogikh

There are also CONFIG_CGROUP_BPF and CONFIG_BPF_LSM kernel options, but I am not sure if we want to add this dimension to our kernel config generation. But that would be the most reliable way to prevent these from happening.

Apr 22 '25 13:04 a-nogikh

As discussed offline: to resolve this specific case, we could do CLONE_NEWCGROUP and set up cgroups for each proc independently. Hopefully the total diff will be small.

Apr 22 '25 16:04 a-nogikh

FTR

How cgroups are currently configured:

https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/executor.cc#L575 https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/executor_runner.h#L678-L683 This is where we first set them up. https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L3798-L3800

During per-proc sandboxing

We make /syzcgroups available within the chroot'ed fs for each proc: https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L3966 https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L3894

Before the execution loop

After sandboxing, right before starting the execution loop, we also do some more cgroup configuration. https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common.h#L612 https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L4835 https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L3806

That apparently configures some per-proc sub-cgroups: https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L3815-L3816

And we also do some work per each program execution.

https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common.h#L648 https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L4874

It configures symlinks: https://github.com/google/syzkaller/blob/9c80ffa0c732875f03df82e97c8294cfdf76c8a5/executor/common_linux.h#L3863-L3874

Apr 24 '25 11:04 a-nogikh

The original SYZFAIL: ebtable checkpoint: socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) seems to be quite difficult to reliably reproduce, even though it happens quite frequently during fuzzing.

But SYZFAIL: ebtable: socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) (*) is easy to reproduce:

bpf$BPF_TASK_FD_QUERY(0x14, &(0x7f0000000240)={0x0, 0xffffffffffffffff, 0x0, 0x7, &(0x7f0000000000)='cgroup\x00'}, 0x30)
r0 = openat$cgroup_root(0xffffffffffffff9c, &(0x7f0000000000), 0x200002, 0x0)
r1 = bpf$PROG_LOAD(0x5, &(0x7f0000000080)={0x9, 0x4, &(0x7f00000008c0)=ANY=[@ANYBLOB="180000000000000000000000000000006112000000000000950000000000000051fa7824c74186dc02ec0696c37b64e3b24da3180100000005165c0f63cdc2e82818254950ee03568b8809a1ff4c7c4750eabfafcb9531b31e6a86827d1010c5a909ab98e00e19644a88e95ba26d1c9eecddb2d11c541418ceeb29b9b6829c6e433822bdb3cc85244aab60c1aae1314d7381fcfeb970bea672cf1e926f6a51479343144648a07a975bd89dc398712376610f6254f12495b4658319684387f6f3543205d4bc4ce05b8b961103673dff7f158052e62b20f05fd24108d8363d44fcd0f8f3647899762a17282a1914452d11f557c28f396eebdc858558db0276d14f9035f2b5f703e5be7e4acf8b78c2834ae5805fffee38a9a0033d520bcf6b08ede50899d4b9bdf85c71c5de2503dab358f42a2624c7daa9ed44039aab46419496362e54cfad05a0004ac71a003d7b85d07191bed4e5a890826300214146f7ed569985439baa355c2766dd056f5d79e454f3d873095e7a237bc06d035a8d601f21746d886419f38b34a495040000000071c2f0cce8c93cc17e9afa314fcb2ba15d646c66b0f65021829f87d988b4e2d71753b1549fa734f0b2e56dbd21ed2e09d0cddad721971637f384eed3034597c93e1c52f42cad0ed09c395dc6e9703660fefa1c80f467367c006f25caf0cbcefd13d68839893e39c588eb032905f91cafa4996dbf0c9be9654db05fb918086cc8228d02a3092c0830b8f587a5624515298b2d4eb2bde6f9a2eb83d53f717f13fa7552d92c51dbd32ea50c490ecd085d2811a7555c538cffffff7f00000000dd872244bfa64779e0f43a9c277e2910b7ccdc3d6726d34ad2101033a623ca2a49ad344884289130bc71cee2b7de62bf48129ae1af052a2d46a61625735a9eea7f793946b3229e861d8ea49806b3f7d4295f6b000000000000f337b1ceb2d8a65dcdcd895d7ba37098d2593fdaaef445af5bee02019c00000099b13ecda2a5b37de0519e974cba92ebaf0f701611a9b027ce04340bda4594cc9049c3f101629ab028145e004209ebe71a6fe84af50804000000000000004a27213354964e250a98fe357676f94b6947383e320fbb1118f586d5b9b1b977e1e1a4490ff67703a9b5900f8a6f8a805879dd91ec5ff435b219c53680c0ae04dcc4ef69b98fcb0d6b6a03a8b71a66b4e2876dc4b610444bf10000000000b046b6ae5d68156bcbd6d8793ade9a22ac8fc7857e5bbc14adc4e12b08f350c6789283b9990c72e64372a1f79769a8bdc632fc1a0b3417855d8b7d25ca4d404c23631ad3d2f55dcd385371c86170a4bca58c2b2b4eabc365f45bd10bb45b0c5bc354456a52be18d9b44014d20a3c51c8f013dade83562e73278662829e4f5a9ac00fd91178468c737f0872d97d38d11a176be5a0d7294c51eb161eddcfefa8837c7430721851ec2a107af0df6d43e732bbc01e76c66895eb85d36798d61622773591ee21ad9f6a1b73fa9cf3ffeb8a00b63af800a81d0fb8aa29df8b8ad6fbafefb5802a23cbdeeabceda5bfc5ff2fa5c1d61d04a1324794c6ed000696d9f04010c35474e690545c3d9bd836d4cef2585ba616e01c3d000000000000000000470ebc6f3453ecbf3047e4547d7632d3ad21798e730cb5d1da059b5bdb8107815dff995c0788906790406dfb4f8ee9f24ff94233e2e6e581e6e5de33a5f254c9a8b612547473c3001df3928dac9203b744619082421a8da7c00000000000000000000000000000018a73ef40cca690fb7595c6962984f8276677be6f66cbdbccf1896433808c9c84d74ac4a7c186a04a2250972f7acb156b21f9826b6acb7db32c4e3b3ec8b59fd972975edb1da872d81a35e4fda2f5cbde6b40bea20418c6e9dad30b791eea58f53e80fee4dd7fe08373ea2784fcd3a65261de71eb866458d2c22a"], &(0x7f0000000100)='GPL\x00', 0x0, 0x0, 0x0, 0x0, 0x0, '\x00', 0x0, @cgroup_sock, 0xffffffffffffffff, 0x8, 0x0, 0x0, 0x10, 0x0, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x10, 0x0, @void, @value}, 0x70)
bpf$BPF_PROG_DETACH(0x8, &(0x7f0000000040)={@cgroup=r0, r1, 0x2, 0x2, 0x0, @void, @value}, 0x10)

After running it under strace we get:

bpf(BPF_TASK_FD_QUERY, {task_fd_query={pid=0, fd=-1, flags=0, buf_len=7, buf="cgroup", prog_id=0, fd_type=BPF_FD_TYPE_RAW_TRACEPOINT, probe_offset=0, probe_addr=0}}, 48)
openat(AT_FDCWD, "cgroup", O_RDWR|O_PATH)
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SOCK, insn_cnt=4, insns=0x2000000008c0, license="", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=-1, func_info_rec_size=8, func_info=NULL, func_info_cnt=0, line_info_rec_size=16, line_info=NULL, line_info_cnt=0, attach_btf_id=0}, 112 ...)
bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, attach_type=BPF_CGROUP_INET_SOCK_CREATE, attach_flags=BPF_F_ALLOW_MULTI}, 16)

(so bpf$BPF_PROG_DETACH was misleading and it's in fact BPF_PROG_ATTACH)

And ./cgroup is a symlink that we mount here:

https://github.com/google/syzkaller/blob/c6b4fb399236b655a39701fd51c33522caa06811/executor/common_linux.h#L3863-L3866

So, judging by how cgroups are currently configured, we already have a separate cgroup for each proc. As it was mentioned, we could use the cgroup namespace to make sure that the cgroups are really recreated each time we restart the proc, but it doesn't change the fact that it's perfectly legal to configure bpf/cgroups to deny the socket() syscall.

If we recreate the proc on the SYZFAIL (which we afaik already do), we are going to hit the same problem as the failing program will just configure it all again. Even if we create a separate cgroup namespace for each proc and recreate the cgroup each time.

Are there any other viable options than disabling CONFIG_CGROUP_BPF / only allowing the operation on the snapshot instance?

@dvyukov @FlorentRevest

(*) This one is triggered here:

https://github.com/google/syzkaller/blob/c6b4fb399236b655a39701fd51c33522caa06811/executor/common_linux.h#L4845 https://github.com/google/syzkaller/blob/c6b4fb399236b655a39701fd51c33522caa06811/executor/common_linux.h#L3690 https://github.com/google/syzkaller/blob/c6b4fb399236b655a39701fd51c33522caa06811/executor/common_linux.h#L3626

Given that there's the same underlying scenario, could it be that we get SYZFAIL: ebtable checkpoint: socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) only if reset_loop() for some reason was not called and the proc was restarted?

Apr 28 '25 10:04 a-nogikh

If we recreate the proc on the SYZFAIL (which we afaik already do), we are going to hit the same problem as the failing program will just configure it all again. Even if we create a separate cgroup namespace for each proc and recreate the cgroup each time.

Will it SYZFAIL again? As far as I see we call reset_loop after finishing the previous program, so it should succeed after cgroup recreate. When you test the program with syz-execprog, does it fail on the first run, or on the second?

Apr 29 '25 10:04 dvyukov

Is is that well synchronized? If reply_execute is not blocking until we have processed the results in the parent process, we will just immediately execute reset_loop() and print the SYZFAIL message/exit with a non-zero code.

https://github.com/google/syzkaller/blob/9e70464425eebd2fa590c49a04f09e407d4bc3d4/executor/common.h#L740

Apr 29 '25 13:04 a-nogikh

That's the child process that will exit. I don't immediately see why it should lead to any failures in the parent processes.

Apr 29 '25 13:04 dvyukov

It just came to me that this problem may also affect bug's reproducibility when we start bisecting the execution log. Especially given that we don't assign progs to the same procs, but rather loop the whole log over all reproducers - eventually all of them will be configured to deny basic calls like socket().

Jul 16 '25 12:07 a-nogikh

Tried deleting per-process cgroups before mounting them:

diff --git a/executor/common_linux.h b/executor/common_linux.h
index f9d23ad29..96dae761a 100644
--- a/executor/common_linux.h
+++ b/executor/common_linux.h
@@ -3806,2 +3806,13 @@ static void setup_cgroups()
 #if (SYZ_EXECUTOR || SYZ_REPEAT) && SYZ_EXECUTOR_USES_FORK_SERVER
+
+static void recreate_cgroup(char* cgroupdir)
+{
+       if (rmdir(cgroupdir) != 0 && errno != ENOENT) {
+               failmsg("failed to rmdir a proc cgroup", "%s", cgroupdir);
+       }
+       if (mkdir(cgroupdir, 0777)) {
+               debug("mkdir(%s) failed: %d\n", cgroupdir, errno);
+       }
+}
+
 static void setup_cgroups_loop()
@@ -3816,5 +3827,3 @@ static void setup_cgroups_loop()
        snprintf(cgroupdir, sizeof(cgroupdir), "/syzcgroup/unified/syz%llu", procid);
-       if (mkdir(cgroupdir, 0777)) {
-               debug("mkdir(%s) failed: %d\n", cgroupdir, errno);
-       }
+       recreate_cgroup(cgroupdir);
        // Restrict number of pids per test process to prevent fork bombs.
@@ -3828,5 +3837,3 @@ static void setup_cgroups_loop()
        snprintf(cgroupdir, sizeof(cgroupdir), "/syzcgroup/cpu/syz%llu", procid);
-       if (mkdir(cgroupdir, 0777)) {
-               debug("mkdir(%s) failed: %d\n", cgroupdir, errno);
-       }
+       recreate_cgroup(cgroupdir);
        snprintf(file, sizeof(file), "%s/cgroup.procs", cgroupdir);
@@ -3849,5 +3856,3 @@ static void setup_cgroups_loop()
        snprintf(cgroupdir, sizeof(cgroupdir), "/syzcgroup/net/syz%llu", procid);
-       if (mkdir(cgroupdir, 0777)) {
-               debug("mkdir(%s) failed: %d\n", cgroupdir, errno);
-       }
+       recreate_cgroup(cgroupdir);
        snprintf(file, sizeof(file), "%s/cgroup.procs", cgroupdir);

Yet it fails with

SYZFAIL: failed to rmdir a proc cgroup
/syzcgroup/net/syz2 (errno 16: Device or resource busy)
loop exited with status 67

So actually we cannot delete a cgroup until we have deleted all bpf links associated with it?

CLONE_NEWCGROUP only controls the process's view of the cgroup tree, it does not isolate any resources attached to the cgroups or any new sub-cgroups it might create.

Jul 16 '25 15:07 a-nogikh

Okay, after some more digging, it looks like there may be no straightforward way to clean up these bpf links. The kernel may expose them via bpffs, but for that these need to be explicitly pinned, which is hard to ensure/control in fuzzer-generated programs. We could have created fresh cgroups for each proc, but that essentially means we're leaking them and the associated resources.

So our realistic options are:

Disable CONFIG_CGROUP_BPF. Optionally we could generate a separate config for the snapshot-based instance and keep it enabled there.
Sanitize out the BPF_CGROUP_INET_SOCK_CREATE bpf attach type. But that will be tricky and that essentially also prevents the normal fuzzing of the feature.

So my vote goes to the first option.

Jul 16 '25 15:07 a-nogikh

Is it the worst type of BPF hooks? It feels that any global hooks should be much worse than anything attached to a single cgroup.

I assume you mean the worst in terms of potential negative impact. I think there's "worse". That being said, as Aleksandr mentioned, there's also a bunch of program types (and therefore kernel functions & BPF maps) behind that config. Apart for the snapshot-based instances, we'd lose coverage for that. They are definitely types of BPF programs used in production and for which we're quite happy to have fuzzing coverage :)

What would be needed from the kernel to be able to clean up the side effects? Is it just about those cgroup links?

Jul 16 '25 18:07 pchaigno

Is it just about those cgroup links?

At least it's the consequences from the cgroup links that we mostly observe now.

Are there any other link types that may persist after a program has exited? Sockets should be closed after exit, everything perf-related we fuzz on a separate instance anyway. LSM hooks, probably?

What would be needed from the kernel to be able to clean up the side effects?

We need some way to undo all the bpf links that were created by a process. It would be just perfect if there were some way to tell the kernel do it automatically on process exit (we fork before executing each program anyway).

Alternatively, if we focus specifically on cgroups, it would be great to have some kernel interface to list all bpf programs attached to a specific cgroup and some interface to remove the links. Then we could do that operation in the proc's loop right before forking.

@dvyukov do you have some more ideas?

Jul 17 '25 08:07 a-nogikh

@dvyukov do you have some more ideas?

Nothing concrete.

Jul 17 '25 09:07 dvyukov

syzbot: lots of `SYZFAIL: ebtable checkpoint: socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)` crashes

How cgroups are currently configured:

During per-proc sandboxing

Before the execution loop

And we also do some work per each program execution.