High memory usage during composition + inability to increase v8 composer's heap size results in OOMs
Component(s)
composition
Component version
v0.0.0-20240521213547-4077ab4a01f0
wgc version
N/A
controlplane version
0.88.2
router version
0.88.9
What happened?
If possible, please create a PR with a failing test to illustrate the issue clearly. Otherwise, please attach a minimum reproduction through a GitHub repository that includes essential information such as the relevant subgraph SDLs. Please also make sure that the instructions for the reproduction are clear, tested, and fully accurate.
Description
We are testing Cosmo self-hosted in our backend as we transition from Apollo. We recently enabled a validation mode where we compose using both Apollo and Cosmo's v8 composer. Cosmo's v8 isolate will routinely OOM during composition. At first these OOMs were happening at the 3Gi limit we placed on the container. We then increased the memory limit of the composition container to 6Gi, however the OOM will still occur at ~4Gi. v8 has a default heap size of 4GB and requires a --max-old-space-size flag (or passed in NODE_OPTIONS) to increase its available heap size.
Normally, I would increase the heap size for v8, but I don't see a mechanism to configure the v8 isolate in https://github.com/wundergraph/cosmo/blob/4077ab4a01f03c1c3dd7e6167255d3d8da80e3dc/composition-go/vm_v8.go#L242-L298.
I'm currently hacking in a patch that will use v8.SetFlags("--max-old-space-size=<new limit>"), however there should probably be a generic mechanism for passing in a config for the v8 runtime.
Excerpt from OOM logs:
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
I'm a bit surprised at the memory requirements compared to Apollo's composition (never pushed the 1Gi limit we set on it). Is this expected?
release commit: https://github.com/wundergraph/cosmo/commit/4077ab4a01f03c1c3dd7e6167255d3d8da80e3dc
Steps to Reproduce
Compose a sufficiently complex graph.
Expected Result
Ability to increase v8 heap size to accommodate complex graph.
Actual Result
Allocation failures during promotion in a GC causing an OOM in composition container.
Example logs from OOM
#
# Fatal javascript OOM in Scavenger: semi-space copy
#
<--- JS stacktrace --->
[18:0x7f35ec399940] 696 ms: Scavenge 21.6 (30.3) -> 21.6 (30.3) MB, 324.9 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[18:0x7f35ec399940] 369 ms: Scavenge 20.7 (30.3) -> 20.7 (30.3) MB, 230.5 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[18:0x7f35ec399940] 135 ms: Scavenge 19.2 (29.1) -> 19.2 (30.3) MB, 3.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
Starting cosmo composition in port <redacted>, admin port <redacted>
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
<--- JS stacktrace --->
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
<--- JS stacktrace --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
<--- JS stacktrace --->
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
#
#
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- JS stacktrace --->
<--- Last few GCs --->
<--- Last few GCs --->
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
#
#
#
<--- JS stacktrace --->
<--- JS stacktrace --->
#
<--- JS stacktrace --->
<--- Last few GCs --->
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
#
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
#
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- JS stacktrace --->
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
Environment information
Environment
OS: minimal linux used in distroless base images Compiler(if manually compiled): go 1.22
Router configuration
No response
Router execution config
No response
Log output
No response
Additional context
No response
WunderGraph commits fully to Open Source and we want to make sure that we can help you as fast as possible. The roadmap is driven by our customers and we have to prioritize issues that are important to them. You can influence the priority by becoming a customer. Please contact us here.
Hi @jsalem-brex, thank you for the report. To provide better support, we need a fully reproducible example. Other than that, what’s the reason for choosing the Go approach rather than using our Typescript composition library directly?
I'd like to use the Golang variant too, since it's our primary language. But pipilne was built using wgc as it was scary to use those js <-> go transformations, judging by this issue not for nothing)
We do not use anything but router.
The only problem we still have is a memory leak when updating the schema (((( https://github.com/wundergraph/cosmo/issues/756
we need a fully reproducible example
I can't provide you with our schemas directly. I am also not sure what about schemas is triggering the high memory usage, so creating an arbitrary example will be difficult. I'll see what I can put together, though.
We likely have more subgraphs in our schema than average. We also have a schema that is duplicated in all of our subgraphs for default shared types. Would either of those contribute to greater than typical memory usage?
Are there any specific configuration levers I should pull to get you more information in the meantime?
what’s the reason for choosing the Go approach rather than using our Typescript composition library directly?
A big reason we started looking at Cosmo in the first place was the ability to have Go as the primary language. It reduces the complexity of maintaining the system and we have a higher density of Go expertise/infra maturity compared to JS.
Is there a reason we shouldn't be using the Go libraries for composition?
Hi @jsalem-brex, thanks for the information. A reproducible example would definitely help. You can also share with me a private example (private repo) that we will handle confidentially.
A big reason we started looking at Cosmo in the first place was the ability to have Go as the primary language. It reduces the complexity of maintaining the system and we have a higher density of Go expertise/infra maturity compared to JS.
That absolutely makes sense.
Is there a reason we shouldn't be using the Go libraries for composition?
No, just for the sake of a potential workaround.
Hi @jsalem-brex,
Could you update on the state with the latest go composition?
Thanks