gvisor Sentry maps 68TB+ address space on Intel but not AMD with KVM

Description

I've been seeing a handful of strange OOM kills when gvisor hits a panic. The kernel stack trace leads me to believe the dumper that's walking the pages for the process is running out of memory in the cgroup:

Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
dump_header+0x4a/0x1f3
oom_kill_process.cold+0xb/0x10
out_of_memory+0x1b9/0x4c0
mem_cgroup_out_of_memory+0x136/0x150
try_charge_memcg+0x6d3/0x790
? shmem_alloc_page+0x9a/0xe0
charge_memcg+0x40/0x90
__mem_cgroup_charge+0x2c/0x90
shmem_add_to_page_cache+0x162/0x360
shmem_getpage_gfp+0x2c8/0x7d0
shmem_fault+0x68/0x1f0
? filemap_map_pages+0x113/0x570
__do_fault+0x37/0x90
__handle_mm_fault+0xc15/0x1660
handle_mm_fault+0xbf/0x280
__get_user_pages+0x213/0x5f0
get_dump_page+0xb2/0x360
dump_user_range+0x74/0xb0
elf_core_dump+0xda3/0xeb0
do_coredump+0x1077/0x1690
get_signal+0x11e/0x8a0
arch_do_signal_or_restart+0xd3/0x660
? __seccomp_filter+0x4cc/0x5c0
exit_to_user_mode_prepare+0xc2/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x47e636

Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 361164]     0 361164   189523     1865   208896        0           900 exe
[ 361183] 65534 361183 17181898068   258296  2240512        0           900 exe
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=gv,mems_allowed=0-1,oom_memcg=/container/gv,task_memcg=/container/gv,task=exe,pid=361183,uid=65534
Memory cgroup out of memory: Killed process 361183 (exe) total-vm:68727592272kB, anon-rss:18820kB, file-rss:10852kB, shmem-rss:1003512kB, UID:65534 pgtables:2188kB oom_score_adj:900

The only processes in the cgroup are sentry and gofer, and the cgroup has a 1GB memory.max setting.

I was poking around because that 68727592272kB total-vm number seems really strange and I've been able to track it down to a difference between AMD and Intel physical/virtual address bits for the KVM platform. I created a simple sleep container bundle with a minimal config.json that just runs "/sleep 30" and stuck a copy of a statically-linked busybox at "/sleep" in the rootfs dir in the bundle. I ran this container on both an Intel server (with dual Xeon Silver 4116) and an AMD server (with a single EPYC 7642). I then peeked at /proc/PID/smaps for each sandbox process.

On the Intel machine I see this mapping right after all the vcpu maps:

3fb6ae31e000-7fb82e31e000 ---p 00000000 00:00 0 
Size:           68725768192 kB

On the AMD machine, I don't see any maps of a concerning size, and this is the map in the position right after the vcpu maps:

7f72ed6e9000-7f72ed7b9000 rw-p 00000000 00:00 0 
Size:                832 kB

What would make gvisor on the Intel server allocate 68TB of virtual address space? Does the Intel server having a 46:48 physical to virtual address bits ratio make the algorithm in fillAddressSpace() do something strange? The AMD server shows a 48:48 ratio, so it does nothing in fillAddressSpace() (vSize < pSize is true, so it bails early).

Steps to reproduce

Setup minimal OCI bundle to run sleep 30
Run latest gvisor on Intel and AMD machine with bundle
While it's running, peek at /proc/PID/smaps to see a humongous map after the vcpu mappings on the Intel machine

runsc version

`release-20220713.0`

docker version (if using docker)

No response

uname

5.15.54

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

Jul 21 '22 01:07 jseba

3fb6ae31e000-7fb82e31e000 ---p 00000000 00:00 0

This is PROT_NONE mapping, so it should not require any physical memory. It is just a place holder.

Jul 28 '22 17:07 avagin

It's not taking any physical memory, I'm trying to understand why this huge mapping is needed for Intel processors but not AMD. When gvisor panics/crashes, the coredump routines in the kernel are getting OOM killed by the cgroup limits while dumping userspace memory (do_coredump -> dump_user_range -> charge_memcg -> mem_cgroup_out_of_memory in the stack trace above), which means we lose the coredump for the crash, making investigations more difficult with only the stack traces.

Aug 11 '22 22:08 jseba

It's not taking any physical memory, I'm trying to understand why this huge mapping is needed for Intel processors but not AMD.

We map the sentry virtual address space to the guest physical address space. It means that the sentry host address space size has to be less than or equal to the guest physical address space.

You can find sizes of address spaces in /proc/cpuinfo:

$ cat /proc/cpuinfo  | grep "address size" | head -n 1
address sizes	: 46 bits physical, 48 bits virtual

On your AMD machine, sizes or phys and virt address spaces are equal and it is why we don't need to create these dummy mappings.

When gvisor panics/crashes, the coredump routines in the kernel are getting OOM killed by the cgroup limits while dumping userspace memory (do_coredump -> dump_user_range -> charge_memcg -> mem_cgroup_out_of_memory in the stack trace above), which means we lose the coredump for the crash, making investigations more difficult with only the stack traces.

I understand the problem but I am not sure that this is happening due to these dummy mappings. dump_use_range skips unmapped pages: https://elixir.bootlin.com/linux/latest/source/fs/coredump.c#L885.

Aug 12 '22 16:08 avagin

A friendly reminder that this issue had no activity for 120 days.

Sep 13 '23 00:09 github-actions[bot]

This issue has been closed due to lack of activity.

Dec 12 '23 00:12 github-actions[bot]