cjs-module-lexer icon indicating copy to clipboard operation
cjs-module-lexer copied to clipboard

Support building for externally shared js builtins

Open mochaaP opened this issue 1 year ago • 18 comments

Initial support for loading unbundled module in AddExternalizedBuiltin. Refactored to use esbuild under the hood.

  • Reduces downstream distribution package size (by not shipping wasm twice and not base64-encoding it)
  • Provides a cleaner stacktrace
  • Easier to patch

To enable this, pass EXTERNAL_PATH=/global/node_modules/cjs-module-lexer to build.js. You shall also pass this path to --shared-builtin-cjs_module_lexer/dist/lexer-path in Node.js's configure.py, with the extra /dist part in the path.

Reference:

https://github.com/nodejs/node/commit/ca5be26b318 https://github.com/nodejs/node/pull/44376 https://src.fedoraproject.org/rpms/nodejs-cjs-module-lexer/pull-request/2 https://github.com/nodejs/undici/pull/2643

mochaaP avatar Jan 26 '24 23:01 mochaaP

@mhdawson This is closely related to my/our work on distro-level WASM blob rebuilds. Could you take a look at this?

khardix avatar Jan 29 '24 11:01 khardix

cc @guybedford request for review :)

mochaaP avatar Jan 29 '24 13:01 mochaaP

@khardix thanks for the heads up, will take a look tomorrow.

mhdawson avatar Jan 29 '24 22:01 mhdawson

@mochaaP I took a look through and could you confirm my understanding that the following:

  1. The way the wasm blob is built does not change.
  2. Instead of always bundling the wasm blob into dist/lexer.js and dist/lexer.mjs, if EXTERNAL_PATH is specified, instead of bundling the wasm blob dist/lexer.js and dist/lexer.mjs will read the wasm blob from EXTERNAL_PATH/dist/lexer.wasm at runtime. If EXTERNAL_PATH is not specified then the wasm blob is bundled into dist/lexer.js and dist/lexer.mjs just like before.
  3. When built for EXTERNAL_PATH at runtime when Node.js is built with --shared-builtin-cjs_module_lexer/dist/lexer-path=<EXTERNAL_PATH> the expectd contents of EXTERNAL_PATH at runtime will be:
   <EXTERNAL_PATH>/lexer.js
                  /dist/lexer.wasm

Node.js will use <EXTERNAL_PATH>/lexer.js as the builtin for dist/lexar and then read the wasm blob from <EXTERNAL_PATH>/dist/lexer.wasm 4) EXTERNAL_PATH should be able to be any path on the filesystem, what you show above it just an example of how you'd plan to use it.

Overall I think that makes sense to me except that it seems to be that lexer.js and lexer.wasm should both be at the same level instead of one extra dist for lexer.wasm. I'd have to double check but I think when specifying --shared-builtin-cjs_module_lexer/dist/lexer-path= that should point to a directory with the contents of dist/lexer.js.

If the expectation was for

     <EXTERNAL_PATH>/lexer.js
                    /lexer.wasm

and the build step was also updated to copy lib/lexer.wasm to dist/lexer.wasm when EXTERNAL_PATH was specified, then providing package that installs the contents of the dist after the build to EXTERNAL_PATH would be more obvious.

Futher I assume that in most cases people will want to use both: --shared-builtin-cjs_module_lexer/lexer-path --shared-builtin-cjs_module_lexer/dist/lexer-path

In that context it might make sense to change where files get written when EXTERNAL_PATH is defined so that a directory called externalized-dist is created with something like this:

externalized-dist/lexer.js
                 /wasm-lexer/lexer.js
                 /wasm-lexer/lexer.mjs
                 /wasm-lexer/lexer.wasm

And then the contents of externalize-dist is bundled into the package to be built/distributed separately. If that makes sense the possibly copying over the licence and maybe the readme into externalized-dist might make sense as well.

The current case where Node.js uses files from the top level as well as dist seems a bit odd to me, but would breaking to fix, but since use of the externalized build will be new it offers the opportunity to tweak a bit.

I think in this case when building to externalize the lexer, Node.js could then set --shared-builtin-cjs_module_lexer=<EXTERNAL_PATH>/lexer --shared-builtin-cjs_module_lexer/dist=<EXTERNAL_PATH>/wasm-lexer/lexer

Also note this is based on how I remeber these options working in terms of what you provide as a path, but it was a while back when I added the option so not 100% sure without some testing.

Just some thoughts, will need @guybedford to comment on wether any of these suggestions make any sense.

mhdawson avatar Jan 31 '24 20:01 mhdawson

@mhdawson:

The current case where Node.js uses files from the top level as well as dist seems a bit odd to me

See https://github.com/nodejs/node/blob/9a25438a62186fa5bd2603536d99dcccd0acd893/lib/internal/modules/esm/translators.js#L82-L93. Loading builtins will need an exact match for the import specifier, so importing from a subdirectory is non-applicable, since there is no filesystem present.

In that context it might make sense to change where files get written when EXTERNAL_PATH is defined so that a directory called externalized-dist is created with something like this:

A problem with the externalized-dist approach is that we run npm build, then npm pack at the %build stage in Fedora, and unpack that tarball to the target path in the %install stage. Only files specified in package.json (to be exact, files, exports and main fields) will be written to the tarball, which excludes your proposed externalized-dist.

mochaaP avatar Feb 01 '24 01:02 mochaaP

Converted indentation in loadWasm.js from tabs to spaces.

mochaaP avatar Feb 01 '24 01:02 mochaaP

See https://github.com/nodejs/node/blob/9a25438a62186fa5bd2603536d99dcccd0acd893/lib/internal/modules/esm/translators.js#L82-L93. Loading builtins will need an exact match for the import specifier, so importing from a subdirectory is non-applicable, since there is no filesystem present.

My point is that using something outside of the dist directory seems odd since if dist is meant to contain externally used assets, also using files outside of dist is non-intuative, at least to me.

mhdawson avatar Feb 01 '24 20:02 mhdawson

A problem with the externalized-dist approach is that we run npm build, then npm pack at the %build stage in Fedora, and unpack that tarball to the target path in the %install stage. Only files specified in package.json (to be exact, files, exports and main fields) will be written to the tarball, which excludes your proposed externalized-dist.

If the approach makes sense to @guybedford then I think adding externalized-dist to the "files" section in the package.json should be acceptable. Depending on how the package was built, dist or externalized dist would be empty but I doubt the extra space of a empty directory would be objectionable. I think the main thing is if the concept of everything being used externally being a single directly or not makes sense/is intended. I look to @guybedford to comment on that.

mhdawson avatar Feb 01 '24 20:02 mhdawson

I think it's better to keep the current directory structure in case any third-party package depends on the dist folder.

mochaaP avatar Feb 02 '24 03:02 mochaaP

@mochaaP what I was suggesting would only change things if the package used the EXTERNAL_PATH option, would third party packages be able to find the files in the Fedora delivered install?

mhdawson avatar Feb 02 '24 15:02 mhdawson

I think that still breaks things, especially when there weren't any obvious outcomes 🧐

mochaaP avatar Feb 03 '24 02:02 mochaaP

Thanks for creating the PR, can't we just make the external path the default build approach instead of bifurcating the entire build process? The base64 technique comes from es-module-shims which runs in the browser primarily hence the benefits of a single file, but since cjs-module-lexer is mostly for Node.js usage we could just use this technique by default and have a single file that reads a local Wasm file. I'd prefer if we could not separate the loadWasm behind a dynamic import, and especially an extensionless import. Let's just inline the entire routine rather?

guybedford avatar Feb 12 '24 02:02 guybedford

When importing this in Node.js itself, how is the extra Wasm file loaded from a path inside of Node.js though?

guybedford avatar Feb 12 '24 02:02 guybedford

@guybedford:

can't we just make the external path the default build approach instead of bifurcating the entire build process?

Node.js loads builtins as a single script file under the internal/deps namespace (see src/node_builtins.cc). Hence, the external build approach is not suitable for embedded use in upstream, and it must be inlined as base64.

However, if us (downstream distribution) decided to unbundle the module, we could get the benefit of faster startups and smaller update sizes due to reduced libnode.so size and shave off the need for an extra base64 step.

When importing this in Node.js itself, how is the extra Wasm file loaded from a path inside of Node.js though?

When importing the module from libnode.so, it's loaded from a base64 buffer, as described above. If we opted in to use externally shared modules instead, it will resolve the module path from EXTERNAL_PATH and read it from there.

I'd prefer if we could not separate the loadWasm behind a dynamic import, and especially an extensionless import. Let's just inline the entire routine rather?

This is handled by esbuild (see the build.js change). I don't remember the exact reason for this, but probably related to tree-shaking (btw, that didn't help). Fixed in e17dd9dce77f238dc8ef43b141527085b49eab65.

mochaaP avatar Feb 12 '24 13:02 mochaaP

Ideally the best way we could achieve these same goals would be by fully migrating this project as a C++ build, work which was already outlined in https://github.com/anonrig/commonjs-lexer. There would be performance benefits to be had as well that way.

Otherwise it seems this PR doesn't apply to anything other than a very custom wrapping approach - my major concern there is that even if we land this, we aren't necessarily testing that embedding itself so it seems a little fragile.

I would be happy to land the PR as-is if @mhdawson also approves. Just talking through the wider concerns, given that I can't comment that well on the particular embedding.

guybedford avatar Feb 18 '24 22:02 guybedford

I'll chat with @khardix on Friday, if he's good with the PR as is, I'm ok as well. I still think the structure could be better, but not enough to block progress on this.

mhdawson avatar Feb 20 '24 15:02 mhdawson

Talked to @khardix , looks like its good from his perspective. He also mentioned that something similar went into undici https://github.com/nodejs/undici/pull/2643 so it is good to be consistent.

mhdawson avatar Feb 22 '24 15:02 mhdawson

@guybedford review request :)

mochaaP avatar Mar 04 '24 22:03 mochaaP

Hm, and since this is approved, can anyone push the merge button? :innocent:

I see the failed CI, but GH no longer show me any details as to why.

khardix avatar Apr 04 '24 10:04 khardix

This request was automatically failed because there were no enabled runners online to process the request for more than 1 days.

iirc it's a github actions outage

mochaaP avatar Apr 04 '24 12:04 mochaaP

@guybedford could you take a look at this?

mochaaP avatar Apr 07 '24 09:04 mochaaP

There does not seem to be any easy way to restart the action for the test.

mhdawson avatar Apr 08 '24 15:04 mhdawson

@mochaaP I can't seem to make a change to one of the files to re-trigger the CI run. @mochaaP could you push a change, possibly adding the CR at the end of package.json?

mhdawson avatar Apr 08 '24 15:04 mhdawson

@mhdawson The workflow requires approval from maintainer to run.

khardix avatar Apr 10 '24 10:04 khardix

Oh I think the culprit here is that ubuntu-18.04 is no longer available in hosted runners.

Consider migrate to ubuntu-latest (latest LTS)?

mochaaP avatar Apr 11 '24 03:04 mochaaP

since the workflow has changed drastically, re-requesting reviews.

mochaaP avatar Apr 12 '24 12:04 mochaaP

@mochaaP were all the workflow changes needed because of the move to unbuntu-latest ?

Needs a review from @guybedford, since he's the primary maintainer for this repo.

mhdawson avatar Apr 12 '24 14:04 mhdawson

nope, but I think it's better done in this way :)

mochaaP avatar Apr 12 '24 14:04 mochaaP

also node.js.yml is unused.

mochaaP avatar Apr 12 '24 14:04 mochaaP