systemd
systemd copied to clipboard
cgroup-util: make sure cg_read_pid() can deal sanely with unmapped PIDs (as in foreign pidns)
In some environments, namely WSL, the cgroup.procs PID list for some reason contain a ton of zeros everywhere. My suspicion is that those are from other instances under the same WSL Kernel, which at least always hosts the system instance with the X/Wayland/PA/Pipe server, so there is a bunch of zeros to be had.
Without this patch, whenever cg_read_pid encounters such a zero, it throws an error. This makes systemd near unusable inside of WSL. Just skipping over any zeros in those lists makes systemd run without any issues for me.
On normal systems, where the list does not contain any zeros to begin with, this has no averse effects.
See also: https://github.com/microsoft/WSL/issues/8879
I'm not sure how WSL1 is involved in this, I haven't used that in ages. This is an issue on the latest WSL.
Just checked, latest WSL boots with cgroup_no_v1=all, so it's full cgroup v2. The layout in /sys/fs/cgroup also looks correct for that to me.
where do these zeros come from? What precisely is WSL doing there?
Are these simply PIDs that live in a separate pidns that we cannot map?
before we add any code around this I'd really prefer to understand what's going on here.
I am not sure cg_read_pid() just skipping over these processes is really the right approach. For various purposes (i.e. "is this cgroup empty?") such a logic would simply be wrong
I'm not sure where they come from. I can only speculate that they are processes from the WSL system distribution, that hosts the X server and friends.
Looks something like this:
$ cat /sys/fs/cgroup/cgroup.procs
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
5
8
9
10
0
0
24
0
49
I'm not sure if there is any other sane way to deal with those zeros than to ignore them. A bit more involved of an approach would be to have cg_read_pid() return those zeros without error, and then have the code in various places further down the line deal with it.
Another approach that I originally implemented and been using for a while specifically modifies append_cgroup() in src/core/dbus-unit.c to continue on EIO from cg_read_pid(ref). This also seems to resolve most issues. At the very least it gets systemctl to work again and not bail with EIO every time. But it's quite ugly to just ignore and press on on EIO.
Comments in https://github.com/microsoft/WSL/issues/8879 imply that WSL is the problem:
The WSL handling of cgroups: it's possible that systemd shouldn't observe zeros in its root cgroup. The systemd container interface suggests that systemd should be running in a new cgroup, but I don't know if that is required. [...]
[...] the same root cause @nullpo-head figured out for the problem when it was reported under distrod (https://github.com/nullpo-head/wsl-distrod/issues/31#issuecomment-1079726209), with the same fix (put systemd in a new cgroup). [...]
That would have to be fixed from Microsofts side then, wouldn't it? Given its WSL itself that's launching systemd here. The issue was reported to them in 2022, and so far no sign of any interest in fixing it from their side.
That would have to be fixed from Microsofts side then, wouldn't it?
If the suspicions prove correct, yes
I tried replacing /sbin/init with a small shellscript that runs exec /usr/bin/unshare -C /lib/systemd/systemd, in an attempt to spawn systemd inside of a new cgroup namespace.
However, that seems to have no effect whatsoever. There is still a bunch of zeros in the top level cgroup.procs.
so i am pretty sure systemd should be fixed to be fine with processes with unmappable pids (which is what those zero PIDs are).
But skipping over them generically as in the proposed patch is problematic. As mentioned for code that checks if a cgroup is empty it is very much relevant to know that there is a process even if its pid cannot be mapped locally. I do have the suspicion that most of the time we want to skip those processes, but just not always.
hence, I figure cg_read_pid() should be changed to take a flags param or so, with a flag CG_PID_SKIP_UNMAPPED or so, which must be explicitly specified to skip the unmapped processes as if they didn't exist, and then every single call site has to be looked at to decide if we should skip or not.
messy, and involved, but I think that's the only right fix.
There's not thaaat many callers from a quick glance, I'll have a look at it.
I've turned the flag around, since for the majority of callers, ignoring the zero-pids is the desired mode of operation. I only found one single place where they are needed, and that's indeed in the empty-check function.
Not sure what's up with the failing/stuck autopkgtests, I've trouble even identifying what the error is, and it looks unrelated.
looks good, but i'd like to see a minor change how we return the zero pids.
I'm leaning towards this approach, but one comment.
On 09.05.2024 12:38, Mike Yuan wrote:
In src/cgtop/cgtop.c https://github.com/systemd/systemd/pull/32534#discussion_r1595280123:
@@ -207,7 +207,7 @@ static int process( return r;
g->n_tasks = 0;
while (cg_read_pid(f, &pid) > 0) {
while (cg_read_pid(f, &pid, /* flags = */ 0) > 0) {Hmm, this is changed back? It should still set |CGROUP_INCLUDE_UNMAPPABLE_PID|?
I don't think it was ever set in this version of the set, fixed now.
Is this a known issue @mrc0mmand ?
16:02:05 Core was generated by `/usr/lib/systemd/tests/unit-tests/test-bus-watch-bind'.
16:02:05 Program terminated with signal SIGABRT, Aborted.
16:02:05 #0 0x000074930b0a8e44 in ?? () from /usr/lib/libc.so.6
16:02:05 [Current thread is 1 (Thread 0x749309e006c0 (LWP 1874))]
16:02:05 (gdb) Load new symbol table from "/systemd-meson-build/test-bus-watch-bind"? (y or n) [answered Y; input not from terminal]
16:02:05 Reading symbols from /systemd-meson-build/test-bus-watch-bind...
16:02:05 (gdb)
16:02:05 Thread 4 (Thread 0x74930b64b940 (LWP 1868)):
16:02:05 #0 0x000074930b0a34e9 in ?? () from /usr/lib/libc.so.6
16:02:05 #1 0x000074930b0a8bf3 in ?? () from /usr/lib/libc.so.6
16:02:05 #2 0x0000568edb5dea61 in main (argc=<optimized out>, argv=<optimized out>) at ../build/src/libsystemd/sd-bus/test-bus-watch-bind.c:220
16:02:05
16:02:05 Thread 3 (Thread 0x7493094006c0 (LWP 1875)):
16:02:05 #0 0x000074930b12c47b in sendmsg () from /usr/lib/libc.so.6
16:02:05 #1 0x000074930b4240dd in write_to_journal (level=level@entry=31, error=error@entry=-2, file=file@entry=0x74930b537199 "src/libsystemd/sd-bus/sd-bus.c", line=line@entry=3675, func=func@entry=0x74930b57fc00 <__func__.37> "io_callback", object_field=object_field@entry=0x0, object=<optimized out>, extra_field=<optimized out>, extra=<optimized out>, buffer=<optimized out>) at ../build/src/basic/log.c:767
16:02:05 #2 0x000074930b423555 in log_dispatch_internal (level=31, level@entry=7, error=error@entry=-2, file=file@entry=0x74930b537199 "src/libsystemd/sd-bus/sd-bus.c", line=line@entry=3675, func=func@entry=0x74930b57fc00 <__func__.37> "io_callback", object_field=object_field@entry=0x0, object=0x0, extra_field=0x0, extra=0x0, buffer=0x7493093ff1b0 "Processing of bus failed, closing down: No such file or directory") at ../build/src/basic/log.c:813
16:02:05 #3 0x000074930b4238a9 in log_internalv (level=7, error=-2, file=0x74930b537199 "src/libsystemd/sd-bus/sd-bus.c", line=3675, func=0x74930b57fc00 <__func__.37> "io_callback", format=0x74930b537828 "Processing of bus failed, closing down: %m", ap=0x7493093ffa10) at ../build/src/basic/log.c:890
16:02:05 #4 0x000074930b423944 in log_internal (level=level@entry=7, error=error@entry=-2, file=file@entry=0x74930b537199 "src/libsystemd/sd-bus/sd-bus.c", line=line@entry=3675, func=func@entry=0x74930b57fc00 <__func__.37> "io_callback", format=format@entry=0x74930b537828 "Processing of bus failed, closing down: %m") at ../build/src/basic/log.c:905
16:02:05 #5 0x000074930b49bbee in io_callback (s=<optimized out>, fd=<optimized out>, revents=<optimized out>, userdata=0x749304000e60) at ../build/src/libsystemd/sd-bus/sd-bus.c:3675
16:02:05 #6 0x000074930b4fb58e in source_dispatch (s=s@entry=0x749304001ec0) at ../build/src/libsystemd/sd-event/sd-event.c:4222
16:02:05 #7 0x000074930b4fbb3c in sd_event_dispatch (e=e@entry=0x749304000b70) at ../build/src/libsystemd/sd-event/sd-event.c:4843
16:02:05 #8 0x000074930b4fbd73 in sd_event_run (e=e@entry=0x749304000b70, timeout=timeout@entry=18446744073709551615) at ../build/src/libsystemd/sd-event/sd-event.c:4904
16:02:05 #9 0x000074930b4fbdf0 in sd_event_loop (e=0x749304000b70) at ../build/src/libsystemd/sd-event/sd-event.c:4926
16:02:05 #10 0x0000568edb5dd761 in thread_client2 (p=<optimized out>) at ../build/src/libsystemd/sd-bus/test-bus-watch-bind.c:181
16:02:05 #11 0x000074930b0a6ded in ?? () from /usr/lib/libc.so.6
16:02:05 #12 0x000074930b12a0dc in ?? () from /usr/lib/libc.so.6
16:02:05
16:02:05 Thread 2 (Thread 0x74930a8006c0 (LWP 1873)):
16:02:05 #0 0x000074930b0f2f43 in clock_nanosleep () from /usr/lib/libc.so.6
16:02:05 #1 0x0000568edb5ddfb3 in usleep_safe (usec=usec@entry=100000) at ../build/src/basic/time-util.h:228
16:02:05 #2 0x0000568edb5de0e8 in thread_server (p=0x7ffeef516a40) at ../build/src/libsystemd/sd-bus/test-bus-watch-bind.c:63
16:02:05 #3 0x000074930b0a6ded in ?? () from /usr/lib/libc.so.6
16:02:05 #4 0x000074930b12a0dc in ?? () from /usr/lib/libc.so.6
16:02:05
16:02:05 Thread 1 (Thread 0x749309e006c0 (LWP 1874)):
16:02:05 #0 0x000074930b0a8e44 in ?? () from /usr/lib/libc.so.6
16:02:05 #1 0x000074930b050a30 in raise () from /usr/lib/libc.so.6
16:02:05 #2 0x000074930b0384c3 in abort () from /usr/lib/libc.so.6
16:02:05 #3 0x000074930b423a9e in log_assert_failed (text=text@entry=0x568edb5df7a6 "r >= 0", file=file@entry=0x568edb5df011 "src/libsystemd/sd-bus/test-bus-watch-bind.c", line=line@entry=149, func=func@entry=0x568edb5df910 <__func__.4> "thread_client1") at ../build/src/basic/log.c:992
16:02:05 #4 0x0000568edb5dddba in thread_client1 (p=<optimized out>) at ../build/src/libsystemd/sd-bus/test-bus-watch-bind.c:149
16:02:05 #5 0x000074930b0a6ded in ?? () from /usr/lib/libc.so.6
16:02:05 #6 0x000074930b12a0dc in ?? () from /usr/lib/libc.so.6
16:02:05 (gdb) (gdb) #0 0x000074930b0a8e44 in ?? () from /usr/lib/libc.so.6
16:02:05 No symbol table info available.
16:02:05 #1 0x000074930b050a30 in raise () from /usr/lib/libc.so.6
16:02:05 No symbol table info available.
16:02:05 #2 0x000074930b0384c3 in abort () from /usr/lib/libc.so.6
16:02:05 No symbol table info available.
16:02:05 #3 0x000074930b423a9e in log_assert_failed (
16:02:05 text=text@entry=0x568edb5df7a6 "r >= 0",
16:02:05 file=file@entry=0x568edb5df011 "src/libsystemd/sd-bus/test-bus-watch-bind.c", line=line@entry=149,
16:02:05 func=func@entry=0x568edb5df910 <__func__.4> "thread_client1")
16:02:05 at ../build/src/basic/log.c:992
16:02:05 No locals.
16:02:05 #4 0x0000568edb5dddba in thread_client1 (p=<optimized out>)
16:02:05 at ../build/src/libsystemd/sd-bus/test-bus-watch-bind.c:149
16:02:05 error = {
16:02:05 name = 0x74930b535488 "org.freedesktop.DBus.Error.FileNotFound",
16:02:05 message = 0x74930b1bf173 "No such file or directory",
16:02:05 _need_free = 0
16:02:05 }
16:02:05 bus = 0x7492fc000b70
16:02:05 path = <optimized out>
16:02:05 t = 0x749309dffc10 "unix:path=/dev/shm/systemd-watch-bind-vMfsUe/this/is/a/socket"
16:02:05 r = <optimized out>
16:02:05 __func__ = "thread_client1"
16:02:05 #5 0x000074930b0a6ded in ?? () from /usr/lib/libc.so.6
16:02:05 No symbol table info available.
16:02:05 #6 0x000074930b12a0dc in ?? () from /usr/lib/libc.so.6
16:02:05 No symbol table info available.
lgtm.
testing-farm:fedora-rawhide-x86_64 seems to have failed due to some network problems. mkosi / ci (debian, testing): systemd:integration-tests / TEST-70-TPM2
Both look unrelated.
Suggestion 1:
#define PID_UNMAPPED 0 // use this macro in situation when checking for external pidns PIDs
Usage of the macro in conditions will be self-documenting without need to comment each of the guards.
Suggestion 2:
- CGROUP_DONT_SKIP_UNMAPPED = 1 << 3,
+ CGROUP_INCLUDE_UNMAPPED = 1 << 3,
The double negative is confusing.
Suggestion 3: s/UNMAPPED/NONMAPPED/ -- when I first saw the title of the issue I thought about pids that were mapped but aren't anymore -- these are simply not mapped/visible into the target pinds.
(Believe it or not, when I started typing this comment, the PR was still unmerged ;-)
Suggestion 2:
- CGROUP_DONT_SKIP_UNMAPPED = 1 << 3, + CGROUP_INCLUDE_UNMAPPED = 1 << 3,
This is already discussed in https://github.com/systemd/systemd/pull/32534#discussion_r1595241541.
Could I PR this patch to the 255 branch of the stable repo? So that Ubuntu 24.04, which ships on WSL with systemd enabled by default, could hopefully pick it up at some point? Or is that process automatic?
This is already discussed in https://github.com/systemd/systemd/pull/32534#discussion_r1595241541.
I didn't notice that -- I believe that proves the strength of confusion it causes. The error is IMO fine since you can't make pidref to a not mapped PID.
(Triple negative :exploding_head: )
BTW Can this situation be achieved (processes from inaccessible pidns in managed cgroups) anyhow but by the violation of the single writer rule?
Could I PR this patch to the 255 branch of the stable repo? So that Ubuntu 24.04, which ships on WSL with systemd enabled by default, could hopefully pick it up at some point? Or is that process automatic?
Feel free to open a PR. Please use git cherry-pick -x.