gvisor Runsc creates unreasonably large number of golang worker threads, eventually hitting any pids cgroup limit

Description

Runsc/gvisor leaves old threads in a stuck state. This leads to exceeding any sane pids cgroup quota, and is generally bad.

Steps to reproduce

Gvisor version

$ strings /usr/local/bin/runsc|grep release-
release-20200907.0-44-g7e0b019b84b9

Run it as docker thingy

$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@329dded2d7e7:/#

Inspect the cgroup

$ cat /sys/fs/cgroup/pids/docker/329dded2d7e7*/tasks|wc -l
70

Launch a command inside docker

root@329dded2d7e7:/# ls

Check the cgroup again

$ cat /sys/fs/cgroup/pids/docker/329dded2d7e7*/tasks|wc -l
72

The number of threads should not grow, old stuff should be reaped. Specifically you will notice that threads in status "tracing stop" are growing

$ for i in `cat ./tasks`; do cat /proc/$i/status; echo; done|grep State|sort|uniq -c
     46 State:	S (sleeping)
     41 State:	t (tracing stop)
$ for i in `cat ./tasks`; do cat /proc/$i/status; echo; done|grep State|sort|uniq -c
     46 State:	S (sleeping)
     44 State:	t (tracing stop)

This is an example kernel stack

[<0>] ptrace_stop+0x13f/0x2d0
[<0>] get_signal+0x6b5/0xd50
[<0>] do_signal+0x34/0x900
[<0>] __prepare_exit_to_usermode+0x125/0x200
[<0>] prepare_exit_to_usermode+0x9/0x30
[<0>] idtentry_exit_user+0x40/0x60
[<0>] exc_int3+0x74/0x220
[<0>] asm_exc_int3+0x31/0x40

Environment

Please include the following details of your environment:

runsc -version

marek@mrnew:~$ /usr/local/bin/runsc -version
runsc version release-20200907.0-44-g7e0b019b84b9
spec: 1.0.1-dev

Sep 15 '20 10:09 majek

FWIW, It shouldn't be runsc -v, it should be runsc -version. I'll update the issue template.

Sep 15 '20 11:09 ianlewis

You use the ptrace platform. It is known issues. Each time when the Sentry resumes a guest process, it checks whether the current system thread of the sentry process traces the guest process and if it doesn't, it creates a new guest thread and starts tracing it: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/ptrace/subprocess.go;l=88?q=lookupOrCreate&ss=gvisor%2Fgvisor

So the maximum number of guest pid-s is the multiplication of sentry threads and guest processes.

Sep 15 '20 16:09 avagin

Screenshot from 2020-09-15 09-43-53 https://www.infoq.com/presentations/gvisor-os-go/

Sep 15 '20 16:09 avagin

@avagin said:

So the maximum number of guest pid-s is the multiplication of sentry threads and guest processes.

I don't follow. If I start 2000 processes in the container, one after another, not concurrently, and if all of them exit, why do I end up with > 2000 threads? If I start another 2000 processes should I expect 4000 threads?

I don't mind number of threads be proportional to number of currently-running-guest-threads. I do mind number of threads to be proportional to number of processes ever created during lifetime of container. Is that the case?

Sep 15 '20 18:09 majek

I don't follow. If I start 2000 processes in the container, one after another, not concurrently, and if all of them exit, why do I end up with > 2000 threads? If I start another 2000 processes should I expect 4000 threads?

This is strange. I tried to reproduce your case, but I failed. I have only 119 pids in a container cgroup.

T1: $ docker run --device /dev/fuse --runtime runsc --privileged -it --rm b4c85473e5d0 bash

T2: $ cat /sys/fs/cgroup/pids//docker/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/tasks | wc -l 48

T1: root@8da2bd97657b:/fuse-test# for i in seq 2000; do sh -c true; done

T2: $ cat /sys/fs/cgroup/pids//docker/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/tasks | wc -l 119 T2: ps axf

   3517 ?        Ssl    4:12 /usr/bin/containerd
3463297 ?        Sl     0:00  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtim
3463312 ?        Sl     0:00      \_ runsc-gofer --root=/var/run/docker/runtime-runsc/moby --debug=true --log=/run/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/log.json --log-format=json --debug-log=/tmp/runsc/logs/
3463316 pts/0    Ssl+   0:37      \_ runsc-sandbox --root=/var/run/docker/runtime-runsc/moby --debug=true --log=/run/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/log.json --log-format=json --debug-log=/tmp/runsc/log
3463332 ?        tsl    0:00          \_ [exe]
3463392 ?        tl     0:05              \_ [exe]
3463611 ?        tl     0:06              \_ [exe]
3463617 ?        tl     0:06              \_ [exe]
avagin@avagin:~$ ls -l /proc/3463617/task/ | wc -l
22
avagin@avagin:~$ ls -l /proc/3463611/task/ | wc -l
22
avagin@avagin:~$ ls -l /proc/3463392/task/ | wc -l
24
avagin@avagin:~$ ls -l /proc/3463332/task/ | wc -l
6
avagin@avagin:~$ ls -l /proc/3463316/task/ | wc -l
27
avagin@avagin:~$ ls -l /proc/3463312/task/ | wc -l
24
avagin@avagin:~$ ls -l /proc/3463297/task/ | wc -l
11

Sep 16 '20 08:09 avagin

Yeah, so you are saying the number of gvisor threads is capped. In this ticket I thought it is not capped.

What happens in my case, is: I set a limit of pids.max in the cgroup and of course gvisor eventually hits it. I'm trying to trace now just what the limit is, and under what gvisor version it's reached the fastest. I think we had cases where even modest limit (say 10k pids) was reached with a guest program operating just a few dozen threads. I wonder if the cpu pinning (as removed few days ago in f3172c3a11dda8aa21bbaf51b7f63196592cf1da, I often work on 96 logical cpu machines, so maybe that has some side effect) or reaching cfs_cpu_quota could have implications on internal gv thread reuse.

The problem i want to solve: I want guest to have limited number of pids. I want to make sure the OS is not going to see effects of fork bomb inside container. I'm trying to solve it with cgroup pids.max. Somewhat related https://github.com/google/gvisor/issues/2489

Sep 16 '20 08:09 majek

Ok, I think I have some example. Here is the scenario:

$ /usr/local/bin/runsc -version
runsc version release-20200907.0-44-g7e0b019b84b9

$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@46a0985cd38c:/# while [ 1 ]; do ls > /dev/null; done

So here you have it. Only a couple of processes running concurrently (bash, ls, that's it?). If you count the tasks, it will eventually stabilize on 265. Well, a lot, but I won't despair over it:

$ cat /sys/fs/cgroup/pids/docker/46a0*/tasks|wc -l
265

But now, when I repin them to 0th cpu, the number jumps to 616.

$ sudo bash -c '(for i in `cat /sys/fs/cgroup/pids/docker/46a0*/tasks`; do taskset -c -p 0 $i; done)'
$ cat /sys/fs/cgroup/pids/docker/46a0*/tasks|wc -l
616

This goes into insane territory. Do we really need 562 idling OS threads to handle one running ls process? The machine has 48 logical cpus and I think the cpu count makes a difference. On another box with 96 logical cpus the number jumped from 241 to 496 on repin.

Another example:

$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@0d3cf50df413:/# apt update; apt -y install psmisc
root@0d3cf50df413:/# for j in `seq 20`; do for i in `seq 20` ; do (cat /dev/urandom > /dev/null & ); done; sleep 5; killall cat; done
...
$ cat /sys/fs/cgroup/pids/docker/0d3*/tasks|wc -l
2250

Sep 18 '20 13:09 majek

Citing @prattmic from chat:

The tracing stop threads are unused stub threads that are eligible to run user code (stud threads within the same process are eligible only to run user code for user threads in the same process, or a completely unused stub process is eligible to be used for a new user process). We don't currently explicitly try to hard cap the number of stubs we may create (this is something on the roadmap). However, we do try to reuse aggressively, so a well-behaved application (like ls in a loop in your example) shouldn't be creating such a huge number of threads. I believe that is a bug.

Sep 19 '20 11:09 majek

If I read it right, it is square - number-of-guest-pids * number-of-host-golang-threads

Host golang can have pretty much unlimited number of threads (I think 10k is the default max). Then when we context switch to guest, we take the host thread id like: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/ptrace/subprocess.go;l=503?q=lookupOrCreate&ss=gvisor%2Fgvisor

func (s *subprocess) switchToApp(c *context, ac arch.Context) bool {
	currentTID := int32(procid.Current())
	t := s.sysemuThreads.lookupOrCreate(currentTID, s.newThread)

Notice, AFAIU procid.Current is the tid of the host side thread. Then we look for guest thread counterpart, and this would have been fine, but the s.sysemuThreads.lookupOrCreate is scoped per subprocess ie the guest thread.

So in worst case each host thread will have at least one (at most two) threads per each guest thread. Did I read it right?

Sep 21 '20 08:09 majek

So in worst case each host thread will have at least one (at most two) threads per each guest thread. Did I read it right?

This isn't quite right. First, threadPool is per-process not not per-thread, so in worst case each host thread will have one thread per each guest PROCESS.

I don't understand when each host thread can have TWO threads per each guest thread.

Sep 21 '20 16:09 avagin

(1) you are right, let me revise what I said:

So in worst case each host thread will have at least one (at most two) threads per each guest process.

So, for a guest with say 100 processes and host with say 100 golang threads, we could see up to 20k worker threads on gvisor side.

(2) One thread in s.sysemu threadPool and another thread in s.syscallThreads pool. Or did I read it wrong?

Sep 21 '20 17:09 majek

(2) One thread in s.sysemu threadPool and another thread in s.syscallThreads pool. Or did I read it wrong?

You are right. I forgot that we have a separate pool for host syscalls.

Sep 21 '20 18:09 avagin

I see you change the subject. I think the "unreasonably" isn't a good word in this context. You need to understand why it works this way. The kernel ptrace API works on the thread level and one thread can be traced only one another thread. Go-runtime creates a number of system threads and then each goroutine can be running in context of any of this system threads. This means that if subprocess.Syscall calls from the system thread A, creates a stub thread Z, starts tracing it and executes a system call in its context, then if subprocess.Syscall will be called from the system thread B, it will not able to reuse the stub thread Z, because it is already traced by the system thread A.

You can see here is the reason why the Sentry needs to create many threads. You can try to fix this issue, if you will pin go-routines to system threads, but I am not sure that this will not affect a performance. Or you can try to use the kvm platform, it doesn't have this problem.

Sep 29 '20 21:09 avagin

Here is an example how we can reduce the number of subprocess.syscallThreads: https://github.com/avagin/gvisor/commit/5254cf371696682c4eae46fec119f394bdc31400

The same approach can be implemented for the sysemuThreads pool, but this will be a bit more complicated.

Sep 29 '20 21:09 avagin

I will push back. Gvisor is a security isolation layer. Fork bomb is a usual problem. Right now the only way to solve the fork bomb is to put gvisor into pids cgroup and set pids.max to reasonable value that will not overwhelm the system. This feature of gvisor to require quadratic number of OS threads for runsc the process, is making this approach infeasible.

There are two direct questions:

how to correctly solve fork bomb problem? How to ensure the untrusted user doesn't start 200k processes?
does having plenty of runsc threads affect performance? In my old school system programming mind having many threads is a waste of resources.

Oct 09 '20 16:10 majek

how to correctly solve fork bomb problem? How to ensure the untrusted user doesn't start 200k processes?

First of all, you should not use the ptrace platform in this case. It has not been designed to cover this case. You can use the kvm platform and limit the number of goruntime threads with cgroup pids.max.

does having plenty of runsc threads affect performance? In my old school system programming mind having many threads is a waste of resources.

The world isn't just black or white. I already explained that the ptrace kernel interface has limitations. Any thread can be traced only by one another thread. For the ptrace platform, this means that we have two choices:

We can pin task goroutines to system threads, but this will prevent effective scheduling of these goroutines.
We can fork a new stub thread for each goruntime system thread. This is how it works now.

ptrace was designed for testing and quick experiments. It works fine for workloads with small amount of tasks. For testing needs, the behavior should be close to the other gvisor platforms, this is probably a reason the variant with pinned task goroutines has not been implemented.

@majek do you have a real use cases where you need to use the ptrace platform? Maybe you can describe it and we will better understand your problem and ways to solve it. I implemented a POC version of the first variant with pinned goroutines https://github.com/google/gvisor/pull/4487/commits/fd14898a87b9bd2395147201ed89bfb1fdd5ea7f, you can try to play with it.

Oct 12 '20 18:10 avagin

I must be missing something larger here. KVM was experimental till not very recent. People built products on ptrace. ARM kvm support is not finished (https://github.com/google/gvisor/issues/63). Are you saying that ptrace is not production ready? Did I miss some big platform discussion in documentation?

Oct 12 '20 19:10 majek

Ptrace is solid, notwithstanding the noted requirement that a containing cgroup is used appropriately. (But the sandbox should be contained in appropriate CPU, memory, etc. cgroups regardless to prevent exhausting other kinds of resources.) I think that Andrei is just saying that ptrace may not scale well for every case (e.g. a massively multithreaded web application may not play with the pid semantics).

But let's get back on track. First, I totally agree with the fork bomb complaint. Even with the the proper pid limit, it will generally manifest as the Go runtime failing to create a new thread internally and dying. This is not acceptable and a flaw worth fixing. I'm not sure that the existence of many OS-level threads is an actual concern, but the ergonomics are definitely bad.

There are two tracks to pursue here. First, @avagin is actually working on a new platform designed to replace ptrace (runs anywhere) that is faster and doesn't suffer from this problem, so you should connect with him. It would be great to understand your use case to help ensure that the platform is addressing the core problems. Maybe a community meeting?

Second, that might take a little bit, so we should improve this issue for ptrace in the meantime. But it's unclear to me: do you want the n^2 to go be limited to n (which can be accomplishing with the pinning, which we could integrate for ptrace at least) or should it actually be limited to a fixed number (which would require some kind of Switch() admission control... also doable)? I think this is why Andrei is asking about the use case, to understand what is the desired state.

Oct 12 '20 20:10 amscanne

But it's unclear to me: do you want the n^2 to go be limited to n ... asking about the use case

I'm running untrusted workloads. Many of them. As an operator I would like to guarantee that a single workload won't be too large burden to kernel scheduler, cgroups machinery etc. Therefore I am limiting the runsc containers in cgroups. I am successfully using memory cgroup (although the oom crashes are awful), cpu cgroup. I attempted to run runsc under the pids cgroup but that failed since a substantial overhead in case of number of used tasks is coming from runsc internal. I can't put runsc and its childs under pids cgroup becasue coming with any number for pids.max is wrong. I want to express something like: the untrusted guest should not be able to run more than say 512 tasks (threads or processes). Gvisor right now does not have a functionality for that, and becasue of ptrace architectural limitations I can't technically put runsc and childs in pids cgroup. Please either:

provide a facilty for me to limit the amount of guest tasks
or fix the internal ptrace runsc engine, to have a predictable number of threads. In such case I'll put runsc and guest processes in a pids cgroup and limit it to (tasks-needed-by-runsc-internals + 512)

Oct 13 '20 09:10 majek

https://github.com/google/gvisor/commits/test/cl336693030

@majek could you try out this change? This is an implementations of the second variant. With it, the ptrace platform creates two threads (tracee and tracer) for each guest thread and two extra threads for each guest process to execute system calls.

As for the first variant, you can easy implement it by limiting a number of pids in the root pid namespace.

Oct 19 '20 20:10 avagin

@avagin This change is awesome, The runsc thread count falls down nicely after guest processes exit. On my box a new process in guest (cat) is creating 8-9 golang threads which seem predictable and fine. I have not done benchmarks to assess performance difference for multi-threaded programs.

Oct 22 '20 11:10 majek

Benchmark time! Unless I did something wrong with my benchmark, the patched branch seems 2x slower than master. Code is here: https://gist.github.com/majek/6ffba6d70b56a95eb36258be11955b97

It creates 100 threads, and each does mmap/munmap 20k times. I chose mmap because afaiu it often goes to the host kernel - this syscall is often (always?) not terminated at gvisor level.

runsc from master:

$ runsc --version
runsc version release-20201019.0-33-g293877cf647a
spec: 1.0.1-dev

$ sudo docker run --runtime=runsc --hostname=vanilla --rm -it xxx 
root@vanilla:/# time /root/thread 

real	2m0.686s
user	1m1.300s
sys	199m42.040s

Patched runsc

$ runsc-patched --version
runsc version release-20201019.0-2-g9e95a7463960
spec: 1.0.1-dev

$ sudo docker run --runtime=runsc-patched --hostname=patched --rm -it octane /bin/bash
root@patched:/# time /root/thread 

real	4m41.731s
user	1m28.920s
sys	467m0.840s

sys time looks out of the blue, but the "real" time is indeed 2x worse in the patched runsc. the box is 4 core (8 logical cpu) i7-7700HQ ubuntu on 5.2 kernel.

The benchmark can be probably simplified, I havent tried one thread.

Oct 27 '20 13:10 majek

OK, a bit more data from my machine if we run the same benchmark with 1 thread instead of 100.

runsc$ time ./main
real    0m4.986s
user    0m3.650s
sys     0m1.350s
runsc-avagin$ time ./main
real    0m8.511s
user    0m1.440s
sys     0m7.080s

So about 1.7x worse performance.

(with 4 core Debian VM (Linux 5.8) on Intel i7-8559U @ 2.70GHz)

Oct 27 '20 16:10 lnsp

So about 1.7x worse performance.

I think I fixed this issue. Could you check it again? Thanks!

Oct 27 '20 22:10 avagin

This is very interesting, I get around

94% of default performance (1 thread), so 6% performance increase over default gVisor
167% of default performance (100 threads), so 1.6x worse than default gVisor

I ran them on a different machine this time (3770K, Linux 5.9) since the benchmarks inside the VM were very flaky. Here's a gist of the measurements.

Oct 29 '20 10:10 lnsp

Repeating my tests from above:

$ runsc-patched --version
runsc version release-20201019.0-70-g8176f6431ec8
spec: 1.0.1-dev

root@patched:/# time /root/thread

real	3m30.277s
user	1m29.110s
sys	347m4.160s

This looks like improvement over the previous version, but still a perf hit over vanilla.

Oct 30 '20 13:10 majek

I warned you that there are reasons to fork these unreasonably large number of threads :). My changes pins task goroutines to system threads and so each switch between task gorotines requires a system context switch.

There is another idea how to fix this issue. We can attach/detach to a stub thread each time when we need to execute something in its context.

Nov 03 '20 21:11 avagin

@avagin @amscanne I would love to know more about a new platform to replace ptrace. One of the thing i have been struggling with is that even with kvm ( which performs much better than ptrace ), we still spend a lot of non-application CPU time, which seems quite wasteful.

For example, here is a container that runs chromium inside gvisor Screenshot from 2020-11-30 14-24-01

As you can see, system CPU is much higher than user CPU, i think we prefer the opposite. a 10s kvm stat record shows:

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

                 HLT      41114    43.61%    98.38%     15.52us 1638547.94us   1413.78us ( +-   5.62% )
       EPT_VIOLATION      30681    32.55%     0.63%      1.48us   5291.40us     12.04us ( +-   2.49% )
  EXTERNAL_INTERRUPT      16379    17.37%     0.95%      1.01us   9098.23us     34.13us ( +-   7.59% )
       EXCEPTION_NMI       5946     6.31%     0.04%      1.07us   7117.78us      4.44us ( +-  29.40% )
               CPUID         66     0.07%     0.01%      1.28us   3052.47us     48.27us ( +-  95.75% )
   PENDING_INTERRUPT         56     0.06%     0.00%      1.55us     11.02us      2.74us ( +-   9.36% )
   PAUSE_INSTRUCTION         30     0.03%     0.00%      3.20us      8.13us      4.97us ( +-   3.82% )

Total Samples:94272, Total events handled time:59084468.87us.

I think HLT ( and VM exit ) in general is very expensive for sentry, but i have no idea how we can optimize that. Also what is that EPT_VIOLATION and when does it happen ?

Nov 30 '20 14:11 dqminh

Interesting, what is generating that table? I suspect that this is "time to service the exit", i.e. time between the exit and a subsequent KVM_RUN. In this case, the high time for HLT is not surprising, as the way the kvm platform works, the Sentry may not run that vCPU again for some time.

The other times seem crazy high though. EPT_VIOLATION should be the equivalent of get_user_pages, and it's weighing in at 12us average? That's a lot. Similar story for EXTERNAL_INTERRUPT and CPUID, both those times are large compared to what I would expect. Is there something else going on here? Is this running nested?

The user time vs system time is also very interesting -- the system time would not be accounted in the table there (but it does align with the huge times we see). Is there a perf top10 or something that says where the system time is going?

Nov 30 '20 18:11 amscanne

@amscanne Apologies, i ran the test on a test host that has kasan enabled, so the numbers are very much inflated. We can discard the previous report. Here's some numbers that we captured on a production node.

Screenshot from 2020-11-30 23-36-22

The report is generated with:

$ sudo timeout 10 perf kvm stat record -p $PID
$ sudo perf kvm stat report

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

                 hlt      42683    55.41%    99.88%      5.00us 4101811.63us   1730.63us ( +-   7.74% )
           interrupt      25461    33.06%     0.07%      0.31us   1548.01us      1.92us ( +-   7.50% )
                 npf       8882    11.53%     0.06%      1.36us    591.95us      4.79us ( +-   2.32% )

Total Samples:77026, Total events handled time:73959951.08us.

I suspect that this is "time to service the exit", i.e. time between the exit and a subsequent KVM_RUN

You are right that this is time between exit and entry ( https://github.com/torvalds/linux/blob/master/tools/perf/arch/x86/util/kvm-stat.c#L23-L24 and https://github.com/torvalds/linux/blob/master/tools/perf/builtin-kvm.c#L75-L101 ). Which then makes sense for such high value, but i assume that we would still want this to be as low as possible ?

Attached is a flamegraph for KVM task, captured with sudo perf record -a -g --call-graph dwarf -F1000 -e cpu-clock:HG -p $RUNSC_SANDBOX_PID -- sleep 10

flamegraph.cpuclock.zip Screenshot from 2020-12-01 00-32-03

Dec 01 '20 00:12 dqminh

gvisor gvisor copied to clipboard

Runsc creates unreasonably large number of golang worker threads, eventually hitting any pids cgroup limit

gvisor
gvisor copied to clipboard