help
help copied to clipboard
NODE_COMPILE_CACHE doesn't cache all loaded files
Node.js Version
v22.8.0
NPM Version
10.8.2
Operating System
Linux xxx 6.5.0-41-generic #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Subsystem
fs, v8, vm
Description
I'm working on a large application that loads around 15k files at startup and takes 20 seconds - a lot of this time I suspect is just compiling the modules. As such, I've been playing around with the new NODE_COMPILE_CACHE option, and I'm a bit confused - this is primarily a request to understand the caching, or some hints on how to debug it myself.
- I registered a
loadplugin to get an exact count of the files loaded - it came out to 15312. The same exact count across multiple runs - exactly as I'd expect - A single run of the application generates between 2700 and 3800 cache files
- Why isn't this constant each time?
- Why isn't this closer to the 15k loaded files?
- an admittedly naive approach to figure out which files have been included in the cache shows that a very small % of application files seem to match any of the hashes - I recursively iterate over every single file in the application to find all files, then compute the hash using CRC32 (with the esm seed, etc)
- on the most recent first run, only 6 files matched - sibling files in the exact same folder did not match
- A second run of the exact same application code, without clearing the cache directory, generates 5155 total files (e.g., an additional 1300-2400 files)
- why are there new files at all?
- If the cache relies on some transient field for the hash key, why aren't there double?
- on the most recent second run, 54 files matched the cache
- on a third run, this increased to 61
Each run of the application should have been exactly the same - e.g., there are no dynamic imports, I literally ran npm start multiple times.
Is there a good way of debugging this? Any hooks or logs that can be triggered either during cache insert, hit or miss? There is no obvious pattern to the files that were matched - e.g., there was a smattering of node_modules and individual/pairs of files from folders within the application with very similar siblings, loaded in the exact same way.
Have I materially misunderstood the implementation of this feature? E.g., is the hash key perhaps dependant on load order?
Minimal Reproduction
It would be extremely difficult to provide a minimal repro here - I expect the complexity of the application is part of the problem. I was not able to repro on a trivial application (e.g., a single huge source file) - in this situation the cache contains a single entry and I'm able to match the hash from the absolute path to the cache file.
Output
No response
Before You Submit
- [X] I have looked for issues that already exist before submitting this
- [X] My issue follows the guidelines in the README file, and follows the 'How to ask a good question' guide at https://stackoverflow.com/help/how-to-ask
cc @joyeecheung as I believe you implemented the cache
I registered a load plugin to get an exact count of the files loaded - it came out to 15312. The same exact count across multiple runs - exactly as I'd expect
Have you checked how many unique files are loaded? The load hooks are triggered multiple times whenever a module identified by the same URL is loaded (which I believe is a drawback of the current module.register() API, I found this behavior weird personally, some in the loaders team said that this would make hot-module replacement easier but I am skeptical about that).
Why isn't this constant each time?
The cache would be re-generated (invalidated) when the code changes, or the flags used to run Node.js is changed etc. It's hard to tell why without looking at the code. Have you tried running it without using the load hooks? To debug it without using the load hooks, you can try the NODE_DEBUG_NATIVE=COMPILE_CACHE environment variable instead.
Thanks for the fast response!
Have you checked how many unique files are loaded?
I used a set - so as long as the uniqueness isn't based on something other than the module URL, it should be good
Have you tried running it without using the load hooks?
I have not - about a year ago we migrated the code from something heinous into a more standard ESM approach, but to do this we used a load hook to add the import attributes where we import JSON files. I can probably fix these (maybe 100 places in total) - would simply eliminating the load hooks be sufficient? Or would I also need to remove the resolve hooks too?
you can try the NODE_DEBUG_NATIVE=COMPILE_CACHE environment variable
as in, remove COMPILE_CACHE=path and use the literal NODE_DEBUG_NATIVE=COMPILE_CACHE instead?
I suspect the load hooks might play a role in the strange behaviors you've seen here - the loader hooks are a bit all over the place and loading : compilation is not really 1:1. The compile cache is only hit when the module gets to the compilation phase. If the application somehow maps many different source code to the same URL, it may fare poorly as the compilation cache is basically mapped to source files 1:1, keyed by the file name (converted from URL if it's ESM).
as in, remove COMPILE_CACHE=path and use the literal NODE_DEBUG_NATIVE=COMPILE_CACHE instead?
You put in both - NODE_COMPILE_CACHE=path to enable the cache, NODE_DEBUG_NATIVE=COMPILE_CACHE to enable the logging.
That gave me exactly the output I wanted - thanks! The problem was me killing the process made it exit halfway through writing the cache files. Now the startup takes only 11 seconds (roughly what I'd expected), there are 25766 cache files (still no idea why it's not closer to the 15k loaded)
Is there a programatic way to persist the files? Or a specific exit code that will ensure it will wait?
Not super important, but how is the hash calculated? I'm still only matching 359 files (out of either 15k or 25k, depending on how you look at it).
const CRC32 = require("crc-32");
const zero = CRC32.buf(Buffer.from([]))
const esm = CRC32.buf(Buffer.from([1]),zero)
function getHash(file, seed) {
const hash = CRC32.buf(Buffer.from(file), seed).toString(16)
if (hash.length === 7) {
return `0${hash}`
}
return hash;
}
getHash('file:///home/zacknewsham/dev/qualia/services/qualia/app/imports/ai_cd_import/tests/eval.test.js', esm)
> '-4f0dfcf2'
No obvious variations (e.g., absolute) exist in the cache directory - can you point me to how to calculate the value? I'm pretty close because it works...some of the time :D
For reference, The loader is super trivial:
- it noops
.jsxfiles (the reason for this is historic, but I wouldn't expect to impact the cache meaningfully) - it adds import sttributes to the JSON imports
Is there a programatic way to persist the files? Or a specific exit code that will ensure it will wait?
There is a WIP in https://github.com/nodejs/node/pull/54971
Not super important, but how is the hash calculated?
It's roughly:
const kCommonJS = 0, kESM = 1;
function getCacheName(filename, type) {
let crc = 0;
crc = zlib.crc32(Buffer.from([type]), crc);
crc = zlib.crc32(Buffer.from(filename, 'utf-8'), crc);
return (crc >>> 0).toString(16) + '.cache';
}
getCacheName('/Users/joyee/projects/node/test/fixtures/compile-cache-flush.js', kCommonJS);
// '6a7f7d45.cache' - matches what I see locally with the logs
Note that the full file name, not the file URL, is part of the hash, and it should be utf-8 encoded.
return (crc >>> 0).toString(16) + '.cache';
The >>> 0 got me to 920 matches :thinking: I think I'll just dump the logs to a file - and get the list of files from there!
Actually I am not sure if the >>> 0 part is needed for zlib.crc32() as the built-in version should always return a uint32 already. I think it's only necessary when you use the crc-32 npm pacakge (I am also not sure how much that differs from zlib.crc32())
It seems there has been no activity on this issue for a while, and it is being closed in 30 days. If you believe this issue should remain open, please leave a comment. If you need further assistance or have questions, you can also search for similar issues on Stack Overflow. Make sure to look at the README file for the most updated links.
It seems there has been no activity on this issue for a while, and it is being closed. If you believe this issue should remain open, please leave a comment. If you need further assistance or have questions, you can also search for similar issues on Stack Overflow. Make sure to look at the README file for the most updated links.