gvisor
gvisor copied to clipboard
Runsc creates unreasonably large number of golang worker threads, eventually hitting any pids cgroup limit
Description
Runsc/gvisor leaves old threads in a stuck state. This leads to exceeding any sane pids cgroup quota, and is generally bad.
Steps to reproduce
- Gvisor version
$ strings /usr/local/bin/runsc|grep release-
release-20200907.0-44-g7e0b019b84b9
- Run it as docker thingy
$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@329dded2d7e7:/#
- Inspect the cgroup
$ cat /sys/fs/cgroup/pids/docker/329dded2d7e7*/tasks|wc -l
70
- Launch a command inside docker
root@329dded2d7e7:/# ls
- Check the cgroup again
$ cat /sys/fs/cgroup/pids/docker/329dded2d7e7*/tasks|wc -l
72
The number of threads should not grow, old stuff should be reaped. Specifically you will notice that threads in status "tracing stop" are growing
$ for i in `cat ./tasks`; do cat /proc/$i/status; echo; done|grep State|sort|uniq -c
46 State: S (sleeping)
41 State: t (tracing stop)
$ for i in `cat ./tasks`; do cat /proc/$i/status; echo; done|grep State|sort|uniq -c
46 State: S (sleeping)
44 State: t (tracing stop)
This is an example kernel stack
[<0>] ptrace_stop+0x13f/0x2d0
[<0>] get_signal+0x6b5/0xd50
[<0>] do_signal+0x34/0x900
[<0>] __prepare_exit_to_usermode+0x125/0x200
[<0>] prepare_exit_to_usermode+0x9/0x30
[<0>] idtentry_exit_user+0x40/0x60
[<0>] exc_int3+0x74/0x220
[<0>] asm_exc_int3+0x31/0x40
Environment
Please include the following details of your environment:
-
runsc -version
marek@mrnew:~$ /usr/local/bin/runsc -version
runsc version release-20200907.0-44-g7e0b019b84b9
spec: 1.0.1-dev
FWIW, It shouldn't be runsc -v
, it should be runsc -version
. I'll update the issue template.
You use the ptrace platform. It is known issues. Each time when the Sentry resumes a guest process, it checks whether the current system thread of the sentry process traces the guest process and if it doesn't, it creates a new guest thread and starts tracing it: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/ptrace/subprocess.go;l=88?q=lookupOrCreate&ss=gvisor%2Fgvisor
So the maximum number of guest pid-s is the multiplication of sentry threads and guest processes.
https://www.infoq.com/presentations/gvisor-os-go/
@avagin said:
So the maximum number of guest pid-s is the multiplication of sentry threads and guest processes.
I don't follow. If I start 2000 processes in the container, one after another, not concurrently, and if all of them exit, why do I end up with > 2000 threads? If I start another 2000 processes should I expect 4000 threads?
I don't mind number of threads be proportional to number of currently-running-guest-threads. I do mind number of threads to be proportional to number of processes ever created during lifetime of container. Is that the case?
I don't follow. If I start 2000 processes in the container, one after another, not concurrently, and if all of them exit, why do I end up with > 2000 threads? If I start another 2000 processes should I expect 4000 threads?
This is strange. I tried to reproduce your case, but I failed. I have only 119 pids in a container cgroup.
T1: $ docker run --device /dev/fuse --runtime runsc --privileged -it --rm b4c85473e5d0 bash
T2: $ cat /sys/fs/cgroup/pids//docker/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/tasks | wc -l 48
T1: root@8da2bd97657b:/fuse-test# for i in seq 2000
; do sh -c true; done
T2: $ cat /sys/fs/cgroup/pids//docker/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/tasks | wc -l 119 T2: ps axf
3517 ? Ssl 4:12 /usr/bin/containerd
3463297 ? Sl 0:00 \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtim
3463312 ? Sl 0:00 \_ runsc-gofer --root=/var/run/docker/runtime-runsc/moby --debug=true --log=/run/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/log.json --log-format=json --debug-log=/tmp/runsc/logs/
3463316 pts/0 Ssl+ 0:37 \_ runsc-sandbox --root=/var/run/docker/runtime-runsc/moby --debug=true --log=/run/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/log.json --log-format=json --debug-log=/tmp/runsc/log
3463332 ? tsl 0:00 \_ [exe]
3463392 ? tl 0:05 \_ [exe]
3463611 ? tl 0:06 \_ [exe]
3463617 ? tl 0:06 \_ [exe]
avagin@avagin:~$ ls -l /proc/3463617/task/ | wc -l
22
avagin@avagin:~$ ls -l /proc/3463611/task/ | wc -l
22
avagin@avagin:~$ ls -l /proc/3463392/task/ | wc -l
24
avagin@avagin:~$ ls -l /proc/3463332/task/ | wc -l
6
avagin@avagin:~$ ls -l /proc/3463316/task/ | wc -l
27
avagin@avagin:~$ ls -l /proc/3463312/task/ | wc -l
24
avagin@avagin:~$ ls -l /proc/3463297/task/ | wc -l
11
Yeah, so you are saying the number of gvisor threads is capped. In this ticket I thought it is not capped.
What happens in my case, is: I set a limit of pids.max
in the cgroup and of course gvisor eventually hits it. I'm trying to trace now just what the limit is, and under what gvisor version it's reached the fastest. I think we had cases where even modest limit (say 10k pids) was reached with a guest program operating just a few dozen threads. I wonder if the cpu pinning (as removed few days ago in f3172c3a11dda8aa21bbaf51b7f63196592cf1da, I often work on 96 logical cpu machines, so maybe that has some side effect) or reaching cfs_cpu_quota could have implications on internal gv thread reuse.
The problem i want to solve: I want guest to have limited number of pids. I want to make sure the OS is not going to see effects of fork bomb inside container. I'm trying to solve it with cgroup pids.max
. Somewhat related https://github.com/google/gvisor/issues/2489
Ok, I think I have some example. Here is the scenario:
$ /usr/local/bin/runsc -version
runsc version release-20200907.0-44-g7e0b019b84b9
$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@46a0985cd38c:/# while [ 1 ]; do ls > /dev/null; done
So here you have it. Only a couple of processes running concurrently (bash, ls, that's it?). If you count the tasks, it will eventually stabilize on 265. Well, a lot, but I won't despair over it:
$ cat /sys/fs/cgroup/pids/docker/46a0*/tasks|wc -l
265
But now, when I repin them to 0th cpu, the number jumps to 616.
$ sudo bash -c '(for i in `cat /sys/fs/cgroup/pids/docker/46a0*/tasks`; do taskset -c -p 0 $i; done)'
$ cat /sys/fs/cgroup/pids/docker/46a0*/tasks|wc -l
616
This goes into insane territory. Do we really need 562 idling OS threads to handle one running ls
process? The machine has 48 logical cpus and I think the cpu count makes a difference. On another box with 96 logical cpus the number jumped from 241 to 496 on repin.
Another example:
$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@0d3cf50df413:/# apt update; apt -y install psmisc
root@0d3cf50df413:/# for j in `seq 20`; do for i in `seq 20` ; do (cat /dev/urandom > /dev/null & ); done; sleep 5; killall cat; done
...
$ cat /sys/fs/cgroup/pids/docker/0d3*/tasks|wc -l
2250
Citing @prattmic from chat:
The tracing stop threads are unused stub threads that are eligible to run user code (stud threads within the same process are eligible only to run user code for user threads in the same process, or a completely unused stub process is eligible to be used for a new user process). We don't currently explicitly try to hard cap the number of stubs we may create (this is something on the roadmap). However, we do try to reuse aggressively, so a well-behaved application (like ls in a loop in your example) shouldn't be creating such a huge number of threads. I believe that is a bug.
If I read it right, it is square - number-of-guest-pids * number-of-host-golang-threads
Host golang can have pretty much unlimited number of threads (I think 10k is the default max). Then when we context switch to guest, we take the host thread id like: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/ptrace/subprocess.go;l=503?q=lookupOrCreate&ss=gvisor%2Fgvisor
func (s *subprocess) switchToApp(c *context, ac arch.Context) bool {
currentTID := int32(procid.Current())
t := s.sysemuThreads.lookupOrCreate(currentTID, s.newThread)
Notice, AFAIU procid.Current is the tid of the host side thread. Then we look for guest thread counterpart, and this would have been fine, but the s.sysemuThreads.lookupOrCreate
is scoped per subprocess
ie the guest thread.
So in worst case each host thread will have at least one (at most two) threads per each guest thread. Did I read it right?
So in worst case each host thread will have at least one (at most two) threads per each guest thread. Did I read it right?
This isn't quite right. First, threadPool is per-process not not per-thread, so in worst case each host thread will have one thread per each guest PROCESS.
I don't understand when each host thread can have TWO threads per each guest thread.
(1) you are right, let me revise what I said:
So in worst case each host thread will have at least one (at most two) threads per each guest process.
So, for a guest with say 100 processes and host with say 100 golang threads, we could see up to 20k worker threads on gvisor side.
(2) One thread in s.sysemu
threadPool and another thread in s.syscallThreads
pool. Or did I read it wrong?
(2) One thread in s.sysemu threadPool and another thread in s.syscallThreads pool. Or did I read it wrong?
You are right. I forgot that we have a separate pool for host syscalls.
I see you change the subject. I think the "unreasonably" isn't a good word in this context. You need to understand why it works this way. The kernel ptrace API works on the thread level and one thread can be traced only one another thread. Go-runtime creates a number of system threads and then each goroutine can be running in context of any of this system threads. This means that if subprocess.Syscall calls from the system thread A, creates a stub thread Z, starts tracing it and executes a system call in its context, then if subprocess.Syscall will be called from the system thread B, it will not able to reuse the stub thread Z, because it is already traced by the system thread A.
You can see here is the reason why the Sentry needs to create many threads. You can try to fix this issue, if you will pin go-routines to system threads, but I am not sure that this will not affect a performance. Or you can try to use the kvm platform, it doesn't have this problem.
Here is an example how we can reduce the number of subprocess.syscallThreads: https://github.com/avagin/gvisor/commit/5254cf371696682c4eae46fec119f394bdc31400
The same approach can be implemented for the sysemuThreads pool, but this will be a bit more complicated.
I will push back. Gvisor is a security isolation layer. Fork bomb is a usual problem. Right now the only way to solve the fork bomb is to put gvisor into pids cgroup and set pids.max
to reasonable value that will not overwhelm the system. This feature of gvisor to require quadratic number of OS threads for runsc the process, is making this approach infeasible.
There are two direct questions:
- how to correctly solve fork bomb problem? How to ensure the untrusted user doesn't start 200k processes?
- does having plenty of runsc threads affect performance? In my old school system programming mind having many threads is a waste of resources.
- how to correctly solve fork bomb problem? How to ensure the untrusted user doesn't start 200k processes?
First of all, you should not use the ptrace platform in this case. It has not been designed to cover this case. You can use the kvm platform and limit the number of goruntime threads with cgroup pids.max.
- does having plenty of runsc threads affect performance? In my old school system programming mind having many threads is a waste of resources.
The world isn't just black or white. I already explained that the ptrace kernel interface has limitations. Any thread can be traced only by one another thread. For the ptrace platform, this means that we have two choices:
- We can pin task goroutines to system threads, but this will prevent effective scheduling of these goroutines.
- We can fork a new stub thread for each goruntime system thread. This is how it works now.
ptrace was designed for testing and quick experiments. It works fine for workloads with small amount of tasks. For testing needs, the behavior should be close to the other gvisor platforms, this is probably a reason the variant with pinned task goroutines has not been implemented.
@majek do you have a real use cases where you need to use the ptrace platform? Maybe you can describe it and we will better understand your problem and ways to solve it. I implemented a POC version of the first variant with pinned goroutines https://github.com/google/gvisor/pull/4487/commits/fd14898a87b9bd2395147201ed89bfb1fdd5ea7f, you can try to play with it.
I must be missing something larger here. KVM was experimental till not very recent. People built products on ptrace. ARM kvm support is not finished (https://github.com/google/gvisor/issues/63). Are you saying that ptrace is not production ready? Did I miss some big platform discussion in documentation?
Ptrace is solid, notwithstanding the noted requirement that a containing cgroup is used appropriately. (But the sandbox should be contained in appropriate CPU, memory, etc. cgroups regardless to prevent exhausting other kinds of resources.) I think that Andrei is just saying that ptrace may not scale well for every case (e.g. a massively multithreaded web application may not play with the pid semantics).
But let's get back on track. First, I totally agree with the fork bomb complaint. Even with the the proper pid limit, it will generally manifest as the Go runtime failing to create a new thread internally and dying. This is not acceptable and a flaw worth fixing. I'm not sure that the existence of many OS-level threads is an actual concern, but the ergonomics are definitely bad.
There are two tracks to pursue here. First, @avagin is actually working on a new platform designed to replace ptrace (runs anywhere) that is faster and doesn't suffer from this problem, so you should connect with him. It would be great to understand your use case to help ensure that the platform is addressing the core problems. Maybe a community meeting?
Second, that might take a little bit, so we should improve this issue for ptrace in the meantime. But it's unclear to me: do you want the n^2 to go be limited to n (which can be accomplishing with the pinning, which we could integrate for ptrace at least) or should it actually be limited to a fixed number (which would require some kind of Switch() admission control... also doable)? I think this is why Andrei is asking about the use case, to understand what is the desired state.
But it's unclear to me: do you want the n^2 to go be limited to n ... asking about the use case
I'm running untrusted workloads. Many of them. As an operator I would like to guarantee that a single workload won't be too large burden to kernel scheduler, cgroups machinery etc. Therefore I am limiting the runsc containers in cgroups. I am successfully using memory cgroup (although the oom crashes are awful), cpu cgroup. I attempted to run runsc under the pids
cgroup but that failed since a substantial overhead in case of number of used tasks is coming from runsc internal. I can't put runsc and its childs under pids cgroup becasue coming with any number for pids.max
is wrong. I want to express something like: the untrusted guest should not be able to run more than say 512 tasks (threads or processes). Gvisor right now does not have a functionality for that, and becasue of ptrace architectural limitations I can't technically put runsc and childs in pids cgroup. Please either:
- provide a facilty for me to limit the amount of guest tasks
- or fix the internal ptrace runsc engine, to have a predictable number of threads. In such case I'll put runsc and guest processes in a pids cgroup and limit it to (tasks-needed-by-runsc-internals + 512)
https://github.com/google/gvisor/commits/test/cl336693030
@majek could you try out this change? This is an implementations of the second variant. With it, the ptrace platform creates two threads (tracee and tracer) for each guest thread and two extra threads for each guest process to execute system calls.
As for the first variant, you can easy implement it by limiting a number of pids in the root pid namespace.
@avagin This change is awesome, The runsc thread count falls down nicely after guest processes exit. On my box a new process in guest (cat) is creating 8-9 golang threads which seem predictable and fine. I have not done benchmarks to assess performance difference for multi-threaded programs.
Benchmark time! Unless I did something wrong with my benchmark, the patched branch seems 2x slower than master. Code is here: https://gist.github.com/majek/6ffba6d70b56a95eb36258be11955b97
It creates 100 threads, and each does mmap/munmap 20k times. I chose mmap because afaiu it often goes to the host kernel - this syscall is often (always?) not terminated at gvisor level.
runsc from master:
$ runsc --version
runsc version release-20201019.0-33-g293877cf647a
spec: 1.0.1-dev
$ sudo docker run --runtime=runsc --hostname=vanilla --rm -it xxx
root@vanilla:/# time /root/thread
real 2m0.686s
user 1m1.300s
sys 199m42.040s
Patched runsc
$ runsc-patched --version
runsc version release-20201019.0-2-g9e95a7463960
spec: 1.0.1-dev
$ sudo docker run --runtime=runsc-patched --hostname=patched --rm -it octane /bin/bash
root@patched:/# time /root/thread
real 4m41.731s
user 1m28.920s
sys 467m0.840s
sys time looks out of the blue, but the "real" time is indeed 2x worse in the patched runsc. the box is 4 core (8 logical cpu) i7-7700HQ ubuntu on 5.2 kernel.
The benchmark can be probably simplified, I havent tried one thread.
OK, a bit more data from my machine if we run the same benchmark with 1 thread instead of 100.
runsc$ time ./main
real 0m4.986s
user 0m3.650s
sys 0m1.350s
runsc-avagin$ time ./main
real 0m8.511s
user 0m1.440s
sys 0m7.080s
So about 1.7x worse performance.
(with 4 core Debian VM (Linux 5.8) on Intel i7-8559U @ 2.70GHz)
So about 1.7x worse performance.
I think I fixed this issue. Could you check it again? Thanks!
This is very interesting, I get around
- 94% of default performance (1 thread), so 6% performance increase over default gVisor
- 167% of default performance (100 threads), so 1.6x worse than default gVisor
I ran them on a different machine this time (3770K, Linux 5.9) since the benchmarks inside the VM were very flaky. Here's a gist of the measurements.
Repeating my tests from above:
$ runsc-patched --version
runsc version release-20201019.0-70-g8176f6431ec8
spec: 1.0.1-dev
root@patched:/# time /root/thread
real 3m30.277s
user 1m29.110s
sys 347m4.160s
This looks like improvement over the previous version, but still a perf hit over vanilla.
I warned you that there are reasons to fork these unreasonably large number of threads :). My changes pins task goroutines to system threads and so each switch between task gorotines requires a system context switch.
There is another idea how to fix this issue. We can attach/detach to a stub thread each time when we need to execute something in its context.
@avagin @amscanne I would love to know more about a new platform to replace ptrace. One of the thing i have been struggling with is that even with kvm ( which performs much better than ptrace ), we still spend a lot of non-application CPU time, which seems quite wasteful.
For example, here is a container that runs chromium inside gvisor
As you can see, system CPU is much higher than user CPU, i think we prefer the opposite. a 10s kvm stat record shows:
Analyze events for all VMs, all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
HLT 41114 43.61% 98.38% 15.52us 1638547.94us 1413.78us ( +- 5.62% )
EPT_VIOLATION 30681 32.55% 0.63% 1.48us 5291.40us 12.04us ( +- 2.49% )
EXTERNAL_INTERRUPT 16379 17.37% 0.95% 1.01us 9098.23us 34.13us ( +- 7.59% )
EXCEPTION_NMI 5946 6.31% 0.04% 1.07us 7117.78us 4.44us ( +- 29.40% )
CPUID 66 0.07% 0.01% 1.28us 3052.47us 48.27us ( +- 95.75% )
PENDING_INTERRUPT 56 0.06% 0.00% 1.55us 11.02us 2.74us ( +- 9.36% )
PAUSE_INSTRUCTION 30 0.03% 0.00% 3.20us 8.13us 4.97us ( +- 3.82% )
Total Samples:94272, Total events handled time:59084468.87us.
I think HLT ( and VM exit ) in general is very expensive for sentry, but i have no idea how we can optimize that. Also what is that EPT_VIOLATION and when does it happen ?
Interesting, what is generating that table? I suspect that this is "time to service the exit", i.e. time between the exit and a subsequent KVM_RUN. In this case, the high time for HLT is not surprising, as the way the kvm platform works, the Sentry may not run that vCPU again for some time.
The other times seem crazy high though. EPT_VIOLATION should be the equivalent of get_user_pages, and it's weighing in at 12us average? That's a lot. Similar story for EXTERNAL_INTERRUPT and CPUID, both those times are large compared to what I would expect. Is there something else going on here? Is this running nested?
The user time vs system time is also very interesting -- the system time would not be accounted in the table there (but it does align with the huge times we see). Is there a perf top10 or something that says where the system time is going?
@amscanne Apologies, i ran the test on a test host that has kasan enabled, so the numbers are very much inflated. We can discard the previous report. Here's some numbers that we captured on a production node.
The report is generated with:
$ sudo timeout 10 perf kvm stat record -p $PID
$ sudo perf kvm stat report
Analyze events for all VMs, all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
hlt 42683 55.41% 99.88% 5.00us 4101811.63us 1730.63us ( +- 7.74% )
interrupt 25461 33.06% 0.07% 0.31us 1548.01us 1.92us ( +- 7.50% )
npf 8882 11.53% 0.06% 1.36us 591.95us 4.79us ( +- 2.32% )
Total Samples:77026, Total events handled time:73959951.08us.
I suspect that this is "time to service the exit", i.e. time between the exit and a subsequent KVM_RUN
You are right that this is time between exit and entry ( https://github.com/torvalds/linux/blob/master/tools/perf/arch/x86/util/kvm-stat.c#L23-L24 and https://github.com/torvalds/linux/blob/master/tools/perf/builtin-kvm.c#L75-L101 ). Which then makes sense for such high value, but i assume that we would still want this to be as low as possible ?
Attached is a flamegraph for KVM task, captured with sudo perf record -a -g --call-graph dwarf -F1000 -e cpu-clock:HG -p $RUNSC_SANDBOX_PID -- sleep 10