firecracker
firecracker copied to clipboard
`deflate_on_oom` doesn't seem to work as expected/documented
After reading the Ballooning documentation my understanding of the deflate_on_oom is that if the parameter is set to true the ballooning device will be deflated automatically if a process in the guest requires memory pages which can not be otherwise provided:
deflate_on_oom: if this is set to true and a guest process wants to allocate some memory which would make the guest enter an out-of-memory state, the kernel will take some pages from the balloon and give them to said process
However, if I run Firecracker with e.g. 2 vCPUs,1gb of memory and a balloon device of 900mb:
{
"target_pages": 230400,
"actual_pages": 230400,
"target_mib": 900,
"actual_mib": 900,
"swap_in": 0,
"swap_out": 0,
"major_faults": 92,
"minor_faults": 3103,
"free_memory": 66572288,
"total_memory": 84398080,
"available_memory": 0,
"disk_caches": 151552,
"hugetlb_allocations": 0,
"hugetlb_failures": 0
}
..and then try to start a Java process in the guest with -Xms800m -Xmx800m (i.e. with a heap size of 800mb) the Java process in the guest will hang, Firecracker will use ~200% CPU time but the actual size occupied by the ballooning device in the guest will not change and remain at 900mb:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2552550 xxxxxxxx 20 0 1063092 82992 82096 R 99,9 0,3 3:44.76 fc_vcpu 1
2552544 xxxxxxxx 20 0 1063092 82992 82096 R 90,9 0,3 3:12.17 firecracker
2552549 xxxxxxxx 20 0 1063092 82992 82096 S 25,0 0,3 0:43.47 fc_vcpu 0
Once I reset the target size of the ballooning device to 100mb, the Java process will become unblocked and start.
However, from the documentation of the deflate_on_oom option I would have expected that the guest kernel would deflate the ballooning device automatically, if deflate_on_oom=true?
If I run the same experiment with deflate_on_oom=false, I instantly get an out of memory error when I trying to start the Java process:
penJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000ce000000, 279576576, 0) failed; error='Not enough space' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 279576576 bytes for committing reserved memory.
which is what I would have expected.
Also, if I increase (i.e. inflate) the balloon to 900m again after I started the Java process, I start getting warnings from the ballooning driver (as documented):
[ 282.580254] virtio_balloon virtio0: Out of puff! Can't get 1 pages
..but the CPU usage again goes up to almost ~200%. Is this expected? I mean, the warnings are OK, but I wouldn't expect that Firecracker will burn all its CPU shares while trying to inflate the balloon?
So to summarize, is the described behavior with deflate_on_oom=true a bug in the implementation or have I misunderstood the behavior of the ballooning device in the event of low memory in the guest?
PS: I've used the following kernel and FC versinons for the experiments: Guest kernel: 5.19.8 Host kernel : 6.5.7 (Ubuntu 20.04) Firecracker : 1.5.1 and 1.6.0-dev ( from today 036d9906)
Hi Volker,
Thanks for reporting this. We will take a look and reproduce it, but in the meantime I'd like to point out that this configuration:
Guest kernel: 5.19.8 Host kernel : 6.5.7 (Ubuntu 20.04)
is not supported. Would you be able to try and reproduce with a supported set of host/guest kernels? What we test with is guest x host = [4.14, 5.10] x [4.14, 5.10, 6.1] (guest 6.1 might work too.
Also, to answer your question:
So to summarize, is the described behavior with deflate_on_oom=true a bug in the implementation or have I misunderstood the behavior of the ballooning device in the event of low memory in the guest?
this should work, and we have tests that indicate it does work, i.e. the balloon gets deflated, however we do not track the CPU time consumed to achieve this.
Hi @simonis, is the answer that @bchalios provided enough, does it resolve your issue, or is there anything else to investigate?
Sorry for the late answer @bchalios , @pb8o. I finally managed to run my experiments on a "supported" platform, but unfortunately the results are exactly the same.
Host: 6.1.72-96.166.amzn2023.x86_64
Guest: 6.1.74 (with the config from microvm-kernel-ci-x86_64-6.1.config plus CONFIG_IP_PNP=y)
Firecracker: v1.6.0 and v1.7.0-dev ( from today 49db07b3)
So to summarize the problem: when I start Firecracker with a large ballooning device and deflate_on_oom: true and then try to start a process in the guest which requires memory reserved by the balloon, the guest seems to hang and the Firecracker threads on the host will run at 100% CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
298704 ec2-user 20 0 1058260 85544 84964 R 57.9 0.0 5:45.06 fc_vcpu 1
298703 ec2-user 20 0 1058260 85544 84964 R 56.6 0.0 5:42.44 fc_vcpu 0
298698 ec2-user 20 0 1058260 85544 84964 R 26.2 0.0 2:33.69 firecracker
The guest itself is not really dead-locked, just extremely slow. I can ssh into it and and see the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
139 root 20 0 3150612 36748 32 S 112.5 3.6 13:00.63 jshell
44 root 20 0 0 0 0 R 100.0 0.0 10:13.17 kswapd0
The Java process I've started (i.e. jshell) is starving because it doesn't get enough memory. But it doesn't run into a hard OOM like when I'm running with deflate_on_oom=false. kswapd0 is running at 100% within the guest.
Querying the balloon metrics from the guest shows that the balloon slightly deflates itself, but this happens extremely slowly. E.g. initially we have something like this:
{
"target_pages": 230400,
"actual_pages": 228864,
"target_mib": 900,
"actual_mib": 894,
"swap_in": 0,
"swap_out": 0,
"major_faults": 7030729,
"minor_faults": 14225986,
"free_memory": 49930240,
"total_memory": 1033064448,
"available_memory": 0,
"disk_caches": 655360,
"hugetlb_allocations": 0,
"hugetlb_failures": 0
}
And after about 45 minutes we get to:
{
"target_pages": 230400,
"actual_pages": 180992,
"target_mib": 900,
"actual_mib": 707,
"swap_in": 0,
"swap_out": 0,
"major_faults": 18425420,
"minor_faults": 37668071,
"free_memory": 50409472,
"total_memory": 1033064448,
"available_memory": 0,
"disk_caches": 806912,
"hugetlb_allocations": 0,
"hugetlb_failures": 0
}
If I wait about 60 minutes, jshell finally starts up and begins to be usable.
So ballooning is indeed "kind" of working, but not really practically usable. I would expect that the ballooning device deflates much more promptly in this case.
I did one more run to confirm the behavior and collect more numbers:
| time | target_mib | actual_mib | free_memory | available_memory |
|---|---|---|---|---|
| 17:05:44 | 900 | 900 | 63778816 | 0 |
| 17:24:26 | 900 | 893 | 50581504 | 0 |
| 17:36:34 | 900 | 879 | 51023872 | 0 |
| 17:44:46 | 900 | 870 | 54579200 | 0 |
| 17:55:46 | 900 | 835 | 67063808 | 0 |
| 18:06:14 | 900 | 797 | 50536448 | 0 |
| 18:13:16 | 900 | 637 | 55500800 | 0 |
| jshell exit | 900 | 637 | 321597440 | 251809792 |
| 18:25:12 | 900 | 637 | 338468864 | 270401536 |
| 18:41:25 | 900 | 637 | 338210816 | 270143488 |
As you can see, it takes more than an hour until jshell becomes responsive (somewhere between 18:06 and 18:13). It also looks like the deflation starts extremely slow but gets faster as time goes on.
The other interesting observation is that after jshell exits, the balloon size doesn't inflate again, although its actual size is way below its target size and there's plenty of free memory. I would have expected that the balloon will automatically and continuously inflate if its size is below the target size and free memory is available. But nothing happens, there's no CPU usage in the Firecracker threads, neither in the host nor in the guest.
PS: these results were collected on a c5.metal instance.