loaders icon indicating copy to clipboard operation
loaders copied to clipboard

Thoughts on persistent caching

Open laverdet opened this issue 2 years ago • 8 comments

I made a loader called dynohot which implements hot module reloading in nodejs as an experimental loader.

One of the requirements of the loader is a code transformation written using Babel. Like all Babel transformations there is a good bit of overhead involved. I wanted to add a file cache to avoid the transformation in the common case where most source files are unchanged since the last invocation. This raised a bunch of questions and ad-hoc solutions that I'd like to share here.

How do we know what previous loaders are doing?

The result of nextLoad may change depending on what loaders are defined before us in the chain. If we want to be able to cache a result then it is necessary to ask each loader for a cache key.

The ad-hoc solution for this is a resolve hook. Cache-aware loaders will define a resolver for "loader:cache-key" and add their cache key to the payload blob. A cache provider will resolve this cache key and use it as a "namespace" while caching.

// `settings` could be anything. I've been defining loader settings in query strings:
// --loader "dynohot?ignore=pattern"
const self = new URL(import.meta.url);
const ignoreString = self.searchParams.get("ignore");
const ignorePattern = ignoreString === null ? /[/\\]node_modules[/\\]/ : new RegExp(ignoreString);

const settings = { ignore: ignoreString };
const cacheKey = JSON.stringify({ name: "dynohot", version: 1, settings });

export const resolve: NodeResolve = async (specifier, context, nextResolve) => {
	if (specifier === "loader:cache-key") {
		const previous = await async function() {
			try {
				const previous = await nextResolve(specifier, context);
				const url = new URL(previous.url);
				return url.searchParams.get("payload");
			} catch {}
		}();
		const payload = `${previous};${cacheKey}`;
		return {
			shortCircuit: true,
			url: `loader:cache-key?payload=${encodeURIComponent(payload)}`,
		};
	}();
	return nextResolve(specifier, context)
};

A cache provider then only needs to do const cacheKey = import.meta.resolve("loader:cache-key") (with try/catch) to build a cache key for the active loader chain.

How do we know all the source dependencies for the previous loader and resolvers?

A load hook can use information from multiple sources to generate a single underlying module source text blob. For example a TypeScript loader would use the settings in tsconfig.json to determine whether or not it should omit type-only imports. This has a material impact on the resulting source payload.

Resolve hooks run into the same issue. package.json, and the absence of package.json in traversed directories, affects the way a specifier is resolved to a module URL.

The ad-hoc solution is to pass forward arbitrary information in the result object but this isn't something that is officially documented and seems subject to the whims of the implementation:

export const load: NodeLoad = async (urlString, context, nextLoad) => {
	const previous = nextLoad(urlString, context);
	const tsconfigURL = new URL("tsconfig.json", urlString);
	const tsconfig = JSON.parse(await fs.readFile(tsconfigURL, "utf8"));
	const result = await transform(previous.source, {
		tsconfigRaw: tsconfig,
	});

	return {
		...previous,
		sourceURLs: [ ...previous.sourceURLs ?? [], tsconfigURL ],
		source: result.code,
	};
};

This would inform a cache provider about which files it needs to stat before returning a cached response. It also would have benefits to other loaders, for example, it would tell dynohot which files it needs to watch for updates.

What is a cache provider?

This led to my final question about whose job it is to cache? No loader should make any presumptions about its position in the loader chain. Multiple loaders implementing different forms of caching would lead to inconsistent caching, less than optimal performance, and duplicated code. Therefore I think terminating your loading chain with a caching loader makes the most sense.

The most simple caching loader would take in the result of nextLoad and save a persistent cache entry for the given moduleURL, sourceURLs, and active chain cache-key. It would be up to the user, and not the earlier loaders, whether or not and how they want to cache.

Another example of a caching loader would be one which JIIT-bundles arbitrary packages under node_modules. I think it's absolutely deranged that packages in the npm ecosystem distribute minified single-file build artifacts from rollup. But of course there's very real benefits to loading and parsing, so I understand why they do this. If this was implemented as a loader it would encourage package authors to distribute plain source files and let the user decide how to manage caching for their project.

Do we standardize this?

I proposed 2 ad-hoc solutions here: loader:cache-key resolution specifier, and sourceURLs array on the result of resolve and load. By asking loaders to provide this information we unlock the possibility of caching / watching loaders. Is this something we want to encourage? I think the cache key solution would be better represented as an export on the loader, but this isn't possible without support in the host environment. sourceURLs I think is pretty close to ideal and is only missing an implementation from the default loaders.

laverdet avatar Aug 21 '23 18:08 laverdet

I wanted to add a file cache to avoid the transformation in the common case where most source files are unchanged since the last invocation.

Here's how I would do it:

import { createReadStream } from 'node:fs';
import { createHash } from 'node:crypto';

export async function resolve(specifier, context, next) {
  const result = await next(specifier, context);
  const url = new URL(result.url);
  if (url.protocol !== 'file:') return result; // for e.g. data: URLs
  const hashChunks = createReadStream(url).pipe(createHash('sha256')).toArray();
  url.searchParams.set(
    import.meta.url, // An almost certainly unique key
    Buffer.concat(hashChunks).toString('base64url')
  );
  return { ...result, url: url.href };
}

By adding the hash to the resolved URL, you are guaranteed per spec that it won't be load more than once, you don't need to implement your own cache.

What is the purpose of your custom loader: scheme? I'm not sure I see why you would need a special scheme.

This led to my final question about whose job it is to cache?

According to the current ES spec, there can be only one module per URL. IIRC it's also a limitation of V8, trying to load more than one module on the same URL would lead to undefined behavior. For this reason, Node.js has an internal module cache which loaders cannot access but can rely upon. With the addition of import attributes, this is slightly more complicated (the module cache is now Map<SerializedKey, ModuleNamespace> with SerializedKey is a serialization of the URL string with the import attributes), but the principle still holds.

So loaders are of course free to add an additional cache layer if they see fit, but I'd expect that wouldn't be necessary for most use cases.

aduh95 avatar Aug 21 '23 18:08 aduh95

I think you misunderstood. I am talking about a persistent file cache for caching the results of transformations between different runs of nodejs, not an in-memory cache for caching instances of modules within a single process. The motivation is explained clearly in the first few lines of the comment.


According to the current ES spec, there can be only one module per URL. IIRC it's also a limitation of V8, trying to load more than one module on the same URL would lead to undefined behavior. For this reason, Node.js has an internal module cache which loaders cannot access but can rely upon.

I'm sorry but none of this is true.

current ES spec, there can be only one module per URL

The resolution process is not specified by es262 at all. HostLoadImportedModule is host-defined and can be anything. They punted this to other specifications, and rightfully so.

IIRC it's also a limitation of V8

v8 doesn't care at all about the module URL, it's just metadata on a module record. When you invoke Module::InstantiateModule you pass a callback which implements the aforementioned host-defined HostLoadImportedModule operation: https://github.com/v8/v8/blob/33651e6252eb96ab12cb1a584385c4f7a60493c2/include/v8-script.h#L205-L217

We can verify with my other project isolated-vm which is as close to raw v8 bindings as you can get in nodejs:

const ivm = require('isolated-vm');
void async function() {
    const isolate = new ivm.Isolate();
    for (let ii = 0; ii < 10; ++ii) {
        console.log(await isolate.compileModule('import foo from "foo"; export {};', { filename: 'file:///wow' }));
    }
}();
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}
Module {}

You can also verify with vm which behaves the same way.

laverdet avatar Aug 21 '23 21:08 laverdet

I'm sorry but none of this is true.

current ES spec, there can be only one module per URL

The resolution process is not specified by es262 at all.

ecma262 defines a [[LoadedModules]] structure that maps a [[Specifier]] to a [[Module]]. It turns out in Node.js we use the absolute URL returned by the resolve hook as [[Specifier]], not sure if that's required in this spec or if it's taken from another spec.

HostLoadImportedModule is host-defined and can be anything.

Sure but it must be stable: "If this operation is called multiple times with the same (referrer, specifier) pair […] then it must perform FinishLoadingImportedModule(referrer, specifier, payload, result) with the same result each time." But we're getting off topic, a loader doesn't have to comply with the ES spec anyway.

I am talking about a persistent file cache for caching the results of transformations between different runs of nodejs, not an in-memory cache for caching instances of modules within a single process

I completely missed that, sorry for the confusion.

aduh95 avatar Aug 21 '23 22:08 aduh95

ecma262 doesn't have any requirements on the specifier except that it's a string; HTML is what requires they be URLs. node is free to make whatever choice it wants here, since it's not a web browser.

ljharb avatar Aug 21 '23 22:08 ljharb

I proposed 2 ad-hoc solutions here: loader:cache-key resolution specifier, and sourceURLs array on the result of resolve and load.

Could we create a cache based on the resolved URL and a hash (like a shasum) of the source returned by nextLoad? Then maybe the cache wouldn't need to know anything about the other hooks in the chain? It would be the same problem as designing a cache for loading files from disk, where the resolved URL is like the filename and nextLoad is like readFile.

GeoffreyBooth avatar Aug 21 '23 23:08 GeoffreyBooth

Could we create a cache based on the resolved URL and a hash (like a shasum) of the source returned by nextLoad?

Ideally you wouldn't need to call nextLoad at all if you don't want to.

Imagine a generalized Babel loader that transforms your source based on the contents of babelrc. You invoke nodejs with something like: node --loader babel --loader transform-cache exotic-script.xyz.

Invocation 1 (fresh):

  • transform-cache looks for a cache entry for exotic-script.xyz, finds nothing
  • transform-cache invokes nextLoad
    • babel invokes nextLoad
      • default load invokes fs.readFile
    • babel runs transform, returns result
  • transform-cache saves a cache entry to .cache or wherever. The cache entry includes the source file's mtime, size, and transformed text
  • Script executes, nodejs exits

Invocation 2 (afterward):

  • transform-cache looks for a cache entry for exotic-script.xyz, finds previous entry
  • transform-cache stats the underlying moduleURL and compares mtime, and size. Finds that they are the same, so it returns the previously transformed source text
  • Script executes, nodejs exits

What I'm suggesting is a cache scheme which allows us to elide the invocation to nextLoad entirely. With the scheme you suggested you will need to read the original source text in addition to the cached source text, each time.

laverdet avatar Aug 22 '23 00:08 laverdet

Ideally you wouldn't need to call nextLoad at all if you don't want to.

That's fine. I guess what this illuminates though is that there can be varying goals for creating a cache: avoiding file reads (the goal you cited) or avoiding processing. Like for example if your loader is the one that does the transpilation, you could use the approach I suggested to load transpiled output from cache rather than doing the transpilation work again.

GeoffreyBooth avatar Aug 22 '23 00:08 GeoffreyBooth

Yeah I didn't mean to say that either case is more valid than the other. Studying both is great.

Thinking about good "best practices" for caching would really benefit the ecosystem. Right now my intuition is that caching should live in dedicated loader. If each loader implements their own caching mechanism then you might actually run into very poor performance on first load, because cache misses aren't free.

laverdet avatar Aug 22 '23 06:08 laverdet