node icon indicating copy to clipboard operation
node copied to clipboard

vm: memory leak on vm.compileFunction with importModuleDynamically

Open legendecas opened this issue 1 year ago • 15 comments

Version

v18.6.0

Platform

Darwin

Subsystem

vm

What steps will reproduce the bug?

Run node with node --experimental-vm-modules --max-heap-size=128 test.js, and "test.js" as:

const vm = require('vm');

function work() {
  const context = vm.createContext({});
  const fn = vm.compileFunction(`
    import('foo').then(() => {});
  `, [], {
    parsingContext: context,
    importModuleDynamically: async (specifier, fn, importAssertions) => {
      const m = new vm.SyntheticModule(['x'], () => {
        m.setExport('x', 1);
      }, {
        context,
      });

      await m.link(() => {});
      await m.evaluate();

      return m;
    },
  });

  fn();
}

(function main() {
  work()
  // yielding to give chance to the evaluation of promises.
  setTimeout(main, 1);
})();

How often does it reproduce? Is there a required condition?

always

What is the expected behavior?

Should not crash as OOM.

What do you see instead?

The program crashed with OOM shortly.

Additional information

Opening this issue to track the problem https://github.com/nodejs/node/pull/44198#issuecomment-1210964980.

legendecas avatar Aug 11 '22 06:08 legendecas

importModuleDynamically() callback is invoked everytime the import statement in the code paased to the vm is evaluated. The essential problem here is that it takes a vm.Script instance(created byvm.runInThisContext()) or a Function instance(created byvm.compileFunction()) as the second argument. Since we can not know when to evaluate the import statement(it may be scheduled to evaluate in one year...), we have to keep track the vm.Script instance or the Function instance forever.

However, I went through all JS files in /lib, and I didn't spot any usage of the second argument. e.g. image

Maybe we can fix this issue by disposing it. It's not backward compatible, but we have to admit this is a pitfall of the design.

ywave620 avatar Oct 07 '22 05:10 ywave620

In case anyone who is confused, the dilemma we have now:

  1. The lifetime of CompiledFnEntry(stored in Environment::id_to_function_map) is bounded to the Function created as a result of vm.compileFunction(), since that Function can be alive for a long time, the CompiledFnEntry is leaked!
  2. In the scenario of vm.runInThisContext(), theContextifyScript(stored in Environment::id_to_script_map) is not strongly referenced at all, so it is a subject to GC. However, it is required when evaluating an import statement in the code paased to vm, hence, we might run into a segmentation fault in https://github.com/nodejs/node/blob/1531ef1c0c30b35b764981683383187261c97779/src/module_wrap.cc#L585-L595 if it is GC'ed prematurely. Check https://github.com/nodejs/node/issues/43205#issuecomment-1272209939 for how to trigger :).

The fix I proposed above works for either of these two issues.

ywave620 avatar Oct 08 '22 03:10 ywave620

I would like to hear your ideas :) @legendecas

ywave620 avatar Oct 08 '22 03:10 ywave620

@ywave620 hey, thank you for digging into this.

The second argument of the importModuleDynamically is essential for userland hooks to resolve the specifier with the initiating script, as defined by the spec: https://tc39.es/ecma262/#sec-hostimportmoduledynamically. I don't think removing the argument referencing script makes sense here.

As for the segfaults of issues like https://github.com/nodejs/node/issues/43205, I believe a V8 refactoring is the best effort to settle a solution that fits our use case on host defined options: https://chromium-review.googlesource.com/c/v8/v8/+/3172764.

I'm drafting a PR based on the V8 refactoring to show how the crash can be fixed: https://github.com/nodejs/node/pull/44923.

legendecas avatar Oct 08 '22 09:10 legendecas

@legendecas Great! Some thoughts on your work:

  • In the first sight, host defined options should be something belonged to a script, rather than a context.
  • In addition, the solution you come up with prevents us from evaluating two scripts in the same context but with different importModuleDynamically() callback for each.

Discussion on the importModuleDynamically() callback:

I'm afraid that you and node/vm group have made a mistake in this point from the very beginning. Please correct me if I'm wrong!

  • First of all, I believe referencingScriptOrModule stated in the spec (should) corresponds to the second parameter of importModuleDynamically() documented in https://nodejs.org/api/vm.html#vmruninthiscontextcode-options
  • From the spec, referencingScriptOrModule is the active script or module, rather than the script resulted from vm.runInThisContext(). Check here: https://tc39.es/ecma262/#sec-import-call-runtime-semantics-evaluation image

ywave620 avatar Oct 08 '22 12:10 ywave620

The term "context" used in v8 and the Node.js VM module refers to different entities. It is important to disambiguate those "contexts" here. When the dynamic import call is evaluated, the referencingScriptOrModule is deducted from the topmost script context or module context's host-defined options, a.k.a. "active script or module". This doesn't prevent us from setting different importModuleDynamically callbacks for each vm.Script.

legendecas avatar Oct 09 '22 07:10 legendecas

In the latest node, the actual object passed as the second argument of importModuleDynamically() callback is not the correct value forreferencingScriptOrModule mentioned in the spec, do you agree? @legendecas The actual object is the initiating vm.Script, while the correct one is the "active script or module".

ywave620 avatar Oct 09 '22 07:10 ywave620

The value passed to the userland hooks is not the exact same value defined in the ecma262 host hooks. Node.js passes its own corresponding implementation-defined values to userland hooks.

legendecas avatar Oct 09 '22 07:10 legendecas

Then this design does not make sense to me. It definitely leads to a memory leak. We have to keep track of every vm.Script object resulted from vm.runInThisContext() because we do not know whether or when the import will be executed

ywave620 avatar Oct 09 '22 07:10 ywave620

Once the vm.Script and all its corresponding script contexts are not reachable from GC roots, we can safely assert the script will not be evaluated again and reclaim its related resources. That's why we need to mark the ContextifyScript as a weak wrap.

legendecas avatar Oct 10 '22 03:10 legendecas

So this is a contradictory design, and that is why we have a crash in https://github.com/nodejs/node/issues/43205#issuecomment-1272209939. We need to pass the vm.Script instance to the importModuleDynamically() as the 2nd argument, however, it may be GC'ed before importModuleDynamically() is invoked.

ywave620 avatar Oct 10 '22 04:10 ywave620

That is the problem that moving away from the id-based reference table and saving the wrap object in the script/module context is trying to resolve -- the solution illustrated at https://github.com/nodejs/node/pull/44923 :)

legendecas avatar Oct 11 '22 03:10 legendecas

I have a proposed fix for vm.compileFunction() in https://github.com/nodejs/node/pull/46785 - I think this should solve the leak for functions (when the importModuleDynamically is noop), but it needs module wraps to be fixed as well to be really useful (because usually one creates some modules in that callback).

https://github.com/nodejs/node/pull/44923 doesn't fix the leak of vm.compileFunction() (https://github.com/nodejs/node/issues/42080), by the way, because the problematic cycle described in the OP of https://github.com/nodejs/node/pull/46785 still exists there, it just switches the reference through WeakMap key -> value (that's still strong reference) to a symbol property reference (also strong). The example in the OP here can actually do without the vm.compileFunction() call - just create lots of vm.SyntheticModule() that seem to be unreachable once evaluated, and the process is still going to crash out of OOM, because the modules are also leaking.

Here is my take of the whole issue of lifetime management around these wrappers. What we need to tell V8 is that:

  1. The node::CompiledFnEntry/node::ContextifyScript/node::ModuleWrap JS wrappers must be kept alive when the compiled v8::Function/v8::Script/v8::Module is still executable - otherwise we get a use-after-free when import() is initiated within them.
  2. The node::CompiledFnEntry/node::ContextifyScript/node::ModuleWrap JS wrappers should be GC'ed once the compiled ``v8::Function/v8::Script/v8::Module` is no longer executable - otherwise we get a leak.

But how Node.js traces the native objects makes it difficult to inform V8's GC about this.

To achieve (1), node::CompiledFnEntry maintains a strong global reference to its JS wrapper, which we'll reset once the compiled v8::Function is no longer reachable. But V8's GC isn't actually informed that the strong global reference is going to be reset once the v8::Function is GC'ed (the "reclaim in weak callback" pattern doesn't actually tell V8's GC about this, because V8's GC obviously can't understand the semantics of a random C++ function), so in its eye the strong global reference just comes out of nowhere and shall remain reachable for who-knows-how-long. The compiled function is a legit JS value that can be passed around in JS, and in this case, as part of the API contract, it ends up in a closure that is referenced by the strong node::CompiledFnEntry wrapper. So V8 then thinks this "out of nowhere" reference to node::CompiledFnEntry wrapper in turn would keep the compiled Function reachable for who-knows-how-long, so it never attempts to GC that function, hence the weak callback is never going to be called to reset the strong global reference even when it's no longer accessible by the users, and we have the leak in (2). This isn't too difficult to fix, we can just switch to a proper GC-aware reference from the compiled v8::Function to node::CompiledFnEntry JS wrapper for (1) (both are legit JS values, so the existing API, e.g. v8::Object::Set()/v8::Object::SetPrivate() already allows us to easily inform V8's GC about this reference), and make the global reference to the node:::CompiledFnEntry JS wrapper weak so that the compiled function is going to be the only thing that keeps the node::CompiledFnEntry JS wrapper alive (2) in the eye of V8's GC, then it's fully capable of dealing with the cycle here.

Things are a bit more complicated for v8::Script/v8::Module because they are not legit JS values, the existing V8 API does not allow us to create the GC-aware reference that we need. And that's why we need https://chromium-review.googlesource.com/c/v8/v8/+/3172764, because it gives us a way to create a GC-aware v8::Script/v8::Module -> node::ContextifyScript/node::ModuleWrap JS wrapper reference using the new host defined options API (technically it's always possible, V8-internals-wise, to set this link up, V8 just doesn't provide the proper casts/APIs for the embedders to set them up, and the new host defined options API just happens to do the job).

Currently to achieve (2), node::ContextifyScript makes the strong global reference to its JS wrapper weak, so once the public vm.ContextifyScript is unreachable we will destroy the wrapper as well as the native object. But that's a bit too early, because v8::Script can still be executable whenvm.ContextifyScript is unreachable, then V8 could GC the node::ContextifyScript wrapper too early, resulting in the a use-after-free segfault in (1) like https://github.com/nodejs/node/issues/43205. With the new host defined options APIs, we'd be able to create a GC-aware v8::Script -> node::ContextifyScript JS wrapper reference similarly and fix the segfault.

In the case of node::ModuleWrap, to achieve (1) it currently maintains a strong global to its JS wrapper, and there's no way for V8 to know when to destroy them, so there is also the leak in (2) even if you just create a bunch of vm.SyntheticModules. However migrating to the new host defined options API is not enough to achieve (2), because it has an additional strong global reference to the v8::Module itself (this isn't an issue for scripts because node::ContextifyScript only needs to reference v8::UnboundScript, which isn't where the host-defined options are stored). We still need to do something about it or otherwise V8 would think this is just going to keep the v8::Module alive forever, which in turn keeps node::ModuleWrap alive forever. I am not sure if https://github.com/nodejs/node/pull/44923 does the job though - it tries to achieve (2) by making the global strong reference to v8::Module weak and then destructs the node::ModuleWrap in the weak callback, but I think that might be a bit too early and it could run into (1) again, since if we make the global strong reference to v8::Module weak, there isn't actually anything else from the JS land that keeps it alive, but we need it to be held alive by vm.SyntheticModule (who holds a reference to the node::ModuleWrap JS wrapper). I think what we need here is probably another way to create a GC-aware node::ModuleWrap to v8::Module back reference (instead of making it weak and using the weak callback - which, again, is not something that V8's GC understands).

joyeecheung avatar Feb 23 '23 15:02 joyeecheung

So I gave this another look while working on another memory issue and I found out why the original fix to vm.compileFunction crashed https://github.com/nodejs/node/issues/47096 - the initial assumption was that we could rely on the Function returned by vm.compileFunction() being alive to keep node::CompiledFnEntry alive. But that was not true - the top-level function could already be GC'ed while import() is initiated inside, in this case V8 would derive the referrer from the calling function's script - and the calling function is not necessarily the top-level function (the one returned by vm.compileFunction()), it could just be the innermost function that needs a closure, which was the case in the reproduction of https://github.com/nodejs/node/issues/47096, and that's why we got a use-after-free, because we saw that the top-level function was GC'ed and mistakenly thought that the rest can be discarded, which was actually too soon.

And then I realized that we might not actually need CompiledFnEntry at all. Its whole purpose was to maintain a "some host defined option" -> "top-level function" mapping. We have been using a number since V8 only allows primitives as host defined options. But with the stage-3 proposal that allows symbols (which are also primitives) to be used as weak map keys, we could use symbols to maintain the mapping. So as long as whatever that would initiate import() is alive, it should keep that symbol alive, and we can use the weak map semantics to keep the wrappers that's supposed to be passed into the callback alive. And all these are known by V8, so V8 can detect cycles. WIP here https://github.com/nodejs/node/pull/48510 - locally it fixes the original test and the regression that caused the revert for vm.compileFunction(). I also added a commit for ModuleWrap that made the OOM in the test case in the OP here go away locally, but will need more investigation to be certain.

joyeecheung avatar Jun 21 '23 16:06 joyeecheung

OP actually contains two leaks - the leak of vm.compileFunction() when importModuleDynamically and the leak of vm. SyntheticModule. https://github.com/nodejs/node/pull/48510 should fix the two, but vm.SourceTextModule would still be leaking because of the persistent v8::Module would keep the ModuleWrap alive forever (SyntheticModule doesn't suffer from this problem because it doesn't need to deal with dynamic import - no user code can be evaluated in it anyway - and therefore doesn't need to keep the ModuleWrap alive)

joyeecheung avatar Jun 22 '23 17:06 joyeecheung

Hi, We have been unable to upgrade our node beyond 16.10.0 and have been stuck since higher version regresses the performance of our jest tests. Is there any solution coming for this from node?

sunilsurana avatar Sep 01 '23 15:09 sunilsurana

For the lurkers 👋 and people coming here without context to the severity, specifically when using jest, I've seen jest tests go from 10 min to over an hour, because of these memory leak issues as described. It would be nice to get some more focus on this issue as I'm sure there are 10^6+ hours of cpu time being wasted a month due to the issue.

@sunilsurana as for your comment and issue set workerIdleMemoryLimit: '1G', in your jest config. It doesn't fix everything but it stops tests form crashing and you'll just have to deal with a bit longer runtime. This issue shouldn't stop you from being on node 18, though tests may take a bit longer.

@legendecas @joyeecheung @ywave620 keep up the great work! And let keep us posted on Pr's and developments, I speak for a wide audience when I say we look forward to a fix to this complex problem 😄

M-a-c-Carter avatar Sep 01 '23 16:09 M-a-c-Carter

The latest news is that https://github.com/nodejs/node/pull/48510 needs a reland of a V8 patch (https://chromium-review.googlesource.com/c/v8/v8/+/4834471) before it can go ahead (the V8 patch should be ready to land, but I'll push the button next Monday because it's not a great idea to merge it on a Friday night ;)) If you are interested you can try building that PR locally and see if the build fixes your bug. See https://github.com/nodejs/node/blob/main/BUILDING.md#building-nodejs-on-supported-platforms for build instructions.

joyeecheung avatar Sep 01 '23 19:09 joyeecheung

Closing as https://github.com/nodejs/node/pull/48510 has landed. We can re-open if that turns out to be incorrect (hopefully not).

joyeecheung avatar Sep 14 '23 15:09 joyeecheung

hi @joyeecheung , come here because we are experiencing exactly the same memory leak issue, but i see it is closed and also referenced #48510 closed as well...so what are the plans for the fix here or for a workaround? we are on node v18...thanks!

stas-sisense avatar Sep 26 '23 18:09 stas-sisense

@stas-sisense https://github.com/nodejs/node/pull/48510 has landed on the main branch, you can try it out using the nightlies (e.g. see https://gist.github.com/joyeecheung/083f6b6d601e64382121001ab87bbed2 on how to install nightlies) or wait for a v20.x release https://github.com/nodejs/node/pull/49874 or the v21.x release next month (I don't think we can backport this to v18.x though, weakmap with symbol keys isn't shipped there)

joyeecheung avatar Sep 26 '23 19:09 joyeecheung