cosmo
cosmo copied to clipboard
High memory usage during composition + inability to increase v8 composer's heap size results in OOMs
Component(s)
composition
Component version
v0.0.0-20240521213547-4077ab4a01f0
wgc version
N/A
controlplane version
0.88.2
router version
0.88.9
What happened?
If possible, please create a PR with a failing test to illustrate the issue clearly. Otherwise, please attach a minimum reproduction through a GitHub repository that includes essential information such as the relevant subgraph SDLs. Please also make sure that the instructions for the reproduction are clear, tested, and fully accurate.
Description
We are testing Cosmo self-hosted in our backend as we transition from Apollo. We recently enabled a validation mode where we compose using both Apollo and Cosmo's v8 composer. Cosmo's v8 isolate will routinely OOM during composition. At first these OOMs were happening at the 3Gi limit we placed on the container. We then increased the memory limit of the composition container to 6Gi, however the OOM will still occur at ~4Gi. v8 has a default heap size of 4GB and requires a --max-old-space-size flag (or passed in NODE_OPTIONS) to increase its available heap size.
Normally, I would increase the heap size for v8, but I don't see a mechanism to configure the v8 isolate in https://github.com/wundergraph/cosmo/blob/4077ab4a01f03c1c3dd7e6167255d3d8da80e3dc/composition-go/vm_v8.go#L242-L298.
I'm currently hacking in a patch that will use v8.SetFlags("--max-old-space-size=<new limit>"), however there should probably be a generic mechanism for passing in a config for the v8 runtime.
Excerpt from OOM logs:
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
I'm a bit surprised at the memory requirements compared to Apollo's composition (never pushed the 1Gi limit we set on it). Is this expected?
release commit: https://github.com/wundergraph/cosmo/commit/4077ab4a01f03c1c3dd7e6167255d3d8da80e3dc
Steps to Reproduce
Compose a sufficiently complex graph.
Expected Result
Ability to increase v8 heap size to accommodate complex graph.
Actual Result
Allocation failures during promotion in a GC causing an OOM in composition container.
Example logs from OOM
#
# Fatal javascript OOM in Scavenger: semi-space copy
#
<--- JS stacktrace --->
[18:0x7f35ec399940] 696 ms: Scavenge 21.6 (30.3) -> 21.6 (30.3) MB, 324.9 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[18:0x7f35ec399940] 369 ms: Scavenge 20.7 (30.3) -> 20.7 (30.3) MB, 230.5 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[18:0x7f35ec399940] 135 ms: Scavenge 19.2 (29.1) -> 19.2 (30.3) MB, 3.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
Starting cosmo composition in port <redacted>, admin port <redacted>
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
<--- JS stacktrace --->
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
<--- JS stacktrace --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
<--- JS stacktrace --->
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
#
#
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- JS stacktrace --->
<--- Last few GCs --->
<--- Last few GCs --->
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
#
#
#
<--- JS stacktrace --->
<--- JS stacktrace --->
#
<--- JS stacktrace --->
<--- Last few GCs --->
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
#
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
#
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- JS stacktrace --->
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
Environment information
Environment
OS: minimal linux used in distroless base images Compiler(if manually compiled): go 1.22
Router configuration
No response
Router execution config
No response
Log output
No response
Additional context
No response