node test_runner: support passing globs

trafficstars

a followup for https://github.com/nodejs/node/pull/47499 this includes a lean implementation of https://github.com/isaacs/node-glob on top of minimatch (added at https://github.com/nodejs/node/pull/47499)

TODO:

[x] make one or more commits that are not semver-major, so the introduction of the globbing code can be backported for future usage
[x] docs

Apr 21 '23 00:04 MoLow

Review requested:

[ ] @nodejs/test_runner

Apr 21 '23 00:04 nodejs-github-bot

CC @isaacs @nodejs/test_runner

Apr 21 '23 00:04 MoLow

@isaacs might have good feedback on this too

Apr 21 '23 11:04 benjamingr

Took a quick pass over it. I agree this is a good start. Tried to highlight a few issues that will likely be gotchas for Windows users.

This boils down to essentially the same algorithm that node-glob used until the recent rewrite; walk down the pattern adding each possible match to the task queue until it runs out. It's punishingly slow for advanced use cases and large folder trees. I definitely recommend a more advanced approach to caching and folder walking like what node-glob uses before making this available on the user-facing fs.glob. But good enough for this use case, and fewer moving parts.

Apr 21 '23 22:04 isaacs

This boils down to essentially the same algorithm that node-glob used until the recent rewrite

So . . . why wouldn’t we just import the latest version of node-glob?

Apr 22 '23 21:04 GeoffreyBooth

So . . . why wouldn’t we just import the latest version of node-glob?

node-glob has many dependencies I am not sure we need/want in core, and it supports many options not really needed. in addition, it also has issues raised on the PR vendoring minimatch - it does not use primordials and it does not use node errors.

I definitely recommend a more advanced approach to caching and folder walking

@isaacs I went over the code and I am not really sure what better caching and folder walking can be done, can you elaborate?

Apr 22 '23 22:04 MoLow

node-glob has many dependencies I am not sure we need/want in core, and it supports many options not really needed. in addition, it also has issues raised on the PR vendoring minimatch - it does not use primordials and it does not use node errors.

If the (ultimate, not present) goal is to expose an fs.glob() functionality that's suitable for general purpose usage, then yes, pulling in node-glob is the way to go. (Fast-glob is another good option, but I think the hazards of diverging from posix shell semantics are probably not worth the increase in performance.)

node-glob is, at its core, path-scurry + minimatch. It uses minipass for streams and iteration, because it's the most performant way to do that, but streams could be swapped out (albeit at the loss of performance and synchronous streaming) with core streams, and iteration could be swapped out (with a significant performance penalty) with plain old JS generators. The dependencies on jackspeak and foreground-child are just for the glob CLI, but that probably shouldn't be included in node core.

The only thing that path-scurry depends on is lru-cache, which is just a very fast cache that prevents it from OOMing if you try to walk a directory with a million files and directories in it. (The child nodes created to reflect fs.readdir results are stored in the cache rather than attached directly to the parent object, which also helps prevent gc pauses from getting abusively long just by virtue of walking a very large object graph.)

@isaacs I went over the code and I am not really sure what better caching and folder walking can be done, can you elaborate?

Sure. In this approach, there's a queue of tasks which are effectively a tuple of {pattern, path}. (pattern is really "minimatch set + index", but same idea.) And then, serially, it walks over the task queue, adding new tasks for all the child entries if there's more pattern to be found there.

This blows up combinatorially in many common patterns. For example, consider: a/**/b/**/*/x.js, so we start with just that in the task queue.

tracing through the algo

Starting at the a folder, evaluating that pattern, we have a **, so we need to evaluate both the patterns **/b/**/*/x.js and b/**/*/x.js (since ** can match against nothing). Say that a has 3 children, a, b, and c. For each of those, we create 2 tasks, one for **/b/**/*/x.js and b/**/*/x.js. So now there's 6 more tasks in the queue, total of 7. Say that each of those have 2 children, so there's a/a/x, a/a/y, a/b/x, a/b/y, a/c/x, and a/c/y. For each of those, we need to test both **/b/**/*/x.js and b/**/*/x.js, so 12 more items in the queue, or 19 unprocessed items just to walk down the first part of a 1-3-2 tree. And we haven't even gotten to the second **. So, even though you're caching stats and readdirs, and not testing the same pattern against the same folder multiple times, it's still blowing up to a lot of unnecessary iterations. And, if a/b/c/x.js matches, you still have to process all the other patterns that could potentially match it.

That this is happening synchronously rather than in parallel, and making extra lstat calls rather than using the Dirents from readdir, makes it worse. Of course, that's maybe fine for this specific use case for the test runner, and easy enough to correct later.

The approach that both fast-glob and node-glob use is, instead of processing a queue of tasks each related to a single pattern and folder, is keep a set of patterns that might be applicable to a given path, and then just do a very efficient folder walk. Node-glob has full support for .. appearing in the pattern, so there are cases where it needs to back up the walk to the parent dir (which does some extra work, potentially processing a path a second time, so it loses some perf in exchange for that bit of correctness).

The caching approach implemented by path-scurry is also significantly more comprehensive and defensive. Within a given path-scurry object, it is nearly impossible to accidentally call lstat or readdir more than once on a given path. (Unless you have over 16*1024 items, in which case fs.readdir may have to be called again, because the cached items will be dropped to prevent OOM.) Path-scurry also caches resolve() calls, which in hot path globbing, will end up soaking up a lot of CPU cycles, since it's common to end up resolving a lot of the same paths over and over again, even if you're clever enough to avoid matching the same pattern against that resolved path.

There's nothing wrong necessarily with just caching each thing in its own standalone object or whatever, but path-scurry evolved from the frustration of juggling all these things and just wanting a simple read-only view of the filesystem which could guarantee that it would only perform the bare minimum of system operations required, and cache everything for the duration of the operation. Once you do take the step of parallelizing the walk, it becomes increasingly challenging to keep it all straight otherwise.

More precise in unrolled pseudocode:

1
path = a
pattern = **/b/**/*/x.js
queue = []
# add non-globstar form to queue
queue = [ {a, b/**/*/x.js} ]
children=[a b c]
queue = [
  {a, b/**/*/x.js}
  {a/a, **/b/**/*/x.js}
  {a/b, **/b/**/*/x.js}
  {a/c, **/b/**/*/x.js}
]
2
# shift queue
path = a
pattern = b/**/*/x.js
queue = [
  {a/a, **/b/**/*/x.js}
  {a/b, **/b/**/*/x.js}
  {a/c, **/b/**/*/x.js}
]
# string type, join and push
queue = [
  {a/a, **/b/**/*/x.js}
  {a/b, **/b/**/*/x.js}
  {a/c, **/b/**/*/x.js}
  {a/b, **/*/x.js}
]
3
# shift queue
path = a/a
pattern = **/b/**/*/x.js
# add non-globstar form to queue
queue = [
  {a/b, **/b/**/*/x.js}
  {a/c, **/b/**/*/x.js}
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
]
children = [x y]
queue = [
  {a/b, **/b/**/*/x.js}
  {a/c, **/b/**/*/x.js}
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
  {a/a/y, **/b/**/*/x.js}
  {a/a/x, **/b/**/*/x.js}
]
4
# shift queue
path = a/b
pattern = **/b/**/*/x.js
queue = [
  {a/c, **/b/**/*/x.js}
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
  {a/a/y, **/b/**/*/x.js}
  {a/a/x, **/b/**/*/x.js}
]
# add non-globstar form to queue
queue = [
  {a/c, **/b/**/*/x.js}
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
  {a/a/y, **/b/**/*/x.js}
  {a/a/x, **/b/**/*/x.js}
  {a/b, b/**/*/x.js}
]
children = [x y]
queue = [
  {a/c, **/b/**/*/x.js}
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
  {a/a/y, **/b/**/*/x.js}
  {a/a/x, **/b/**/*/x.js}
  {a/b, b/**/*/x.js}
  {a/b/y, **/b/**/*/x.js}
  {a/b/x, **/b/**/*/x.js}
]
5
# shift queue
path = a/c
pattern = **/b/**/*/x.js
queue = [
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
  {a/a/y, **/b/**/*/x.js}
  {a/a/x, **/b/**/*/x.js}
  {a/b, b/**/*/x.js}
  {a/b/y, **/b/**/*/x.js}
  {a/b/x, **/b/**/*/x.js}
]
# add non-globstar form to queue
queue = [
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
  {a/a/y, **/b/**/*/x.js}
  {a/a/x, **/b/**/*/x.js}
  {a/b, b/**/*/x.js}
  {a/b/y, **/b/**/*/x.js}
  {a/b/x, **/b/**/*/x.js}
  {a/c, b/**/*/x.js}
]
children = [x y]
queue = [
  {a/b, **/*/x.js}
  {a/a, b/**/*/x.js}
  {a/a/y, **/b/**/*/x.js}
  {a/a/x, **/b/**/*/x.js}
  {a/b, b/**/*/x.js}
  {a/b/y, **/b/**/*/x.js}
  {a/b/x, **/b/**/*/x.js}
  {a/c, b/**/*/x.js}
  {a/c/x, **/b/**/*/x.js}
  {a/c/y, **/b/**/*/x.js}
]
5 done so far, 10 tasks in the queue

So, for small folder trees, yeah, it's fine, because it can't get all that big without child nodes to have to process. But if you wanted to use it for any kind of large or time-sensitive job, the overhead will really start to hurt, because the number of required tests goes up so much faster than the number of items in the tree.

Conversely, by associating all the patterns relevant at a given point in the walk, you only have to visit each path once (barring unavoidable .. patterns), meaning all that path joining and children lookups only have to happen once, and associate all the possibly matching patterns with each child.

1
path = a
patterns = [**/b/**/*/x.js]
# add non-globstar form
patterns = [**/b/**/*/x.js, b/**/*/x.js]
queue = []
children = [a b c]
# filter down to the possibly matching patterns for each
queue = [
  {a/a, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/b, [**/b/**/*/x.js, **/*/x.js, b/**/*/x.js, */x.js]}
  {a/c, [**/b/**/*/x.js, b/**/*/x.js]}
]
2
path = a/a
patterns = [**/b/**/*/x.js, b/**/*/x.js]
queue = [
  {a/b, [**/b/**/*/x.js, **/*/x.js, b/**/*/x.js, */x.js]}
  {a/c, [**/b/**/*/x.js, b/**/*/x.js]}
]
children = [x y]
queue = [
  {a/b, [**/b/**/*/x.js, **/*/x.js, b/**/*/x.js, */x.js]}
  {a/c, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/x, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/y, [**/b/**/*/x.js, b/**/*/x.js]}
]
3
path = a/b
patterns = [**/b/**/*/x.js, **/*/x.js, b/**/*/x.js, */x.js]
queue = [
  {a/c, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/x, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/y, [**/b/**/*/x.js, b/**/*/x.js]}
]
children = [x y]
queue = [
  {a/c, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/x, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/y, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/b/x, [**/b/**/*/x.js, b/**/*/x.js, **/*/x.js, */x.js, x.js]}
  {a/b/y, [**/b/**/*/x.js, b/**/*/x.js, **/*/x.js, */x.js, x.js]}
]
4
path = a/c
patterns = [**/b/**/*/x.js, b/**/*/x.js]
queue = [
  {a/a/x, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/y, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/b/x, [**/b/**/*/x.js, b/**/*/x.js, **/*/x.js, */x.js, x.js]}
  {a/b/y, [**/b/**/*/x.js, b/**/*/x.js, **/*/x.js, */x.js, x.js]}
]
children = [x y]
queue = [
  {a/a/x, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/a/y, [**/b/**/*/x.js, b/**/*/x.js]}
  {a/b/x, [**/b/**/*/x.js, b/**/*/x.js, **/*/x.js, */x.js, x.js]}
  {a/b/y, [**/b/**/*/x.js, b/**/*/x.js, **/*/x.js, */x.js, x.js]}
  {a/c/x, [**/b/**/*/x.js, b/**/*/x.js]}

By filtering the pattern set down (and, in the case of globstar, potentially increasing it as well) at each iteration, you prevent the need to (almost) ever process the same path more than once, and it becomes a lot easier to reason about making it parallel.

The thing that node-glob uses to calculate which patterns are needed for each walk step, and the matches that should be recorded for each, is: https://github.com/isaacs/node-glob/blob/main/src/processor.ts

Note that there's also some special rules for when globstar is allowed to follow a symbolic link.

in addition, it also has issues raised on the PR vendoring minimatch - it does not use primordials and it does not use node errors.

I'm actively eager to use node errors when possible, I know @cjihrig is looking into making that more exposed for userland code.

I've got a polyfill for primordials that seems to work, but I haven't gotten around to porting path-scurry to use it. I figure that's the best first test, since path-scurry is almost zero-dep, and has a good suite of benchmarks to see whether the perf impact is too bad. If path-scurry, lru-cache, and minimatch all can be ported to use primordials without too much of a performance impact, then I think porting glob is going to be pretty easy. But that might be a pretty big "if".

Apr 23 '23 00:04 isaacs

@isaacs thanks for the detailed response. I will go over it later. if there is anything I can do to help to get primordials in glob and minimatch LMK. regarding errors I think that is a smaller issue but @cjihrig has closed the issue requesting to expose internal errors

Apr 24 '23 07:04 MoLow

This will enable interop with TypeScript, right? i.e.

node --test --loader=ts-node/esm "test/**/*.ts"

Just asking to be sure the requirement on file extensions being .js/.cjs/.mjs is gone after this PR :)

Thanks for all your hard work on the test runner @MoLow !

May 16 '23 12:05 meyfa

This will enable interop with TypeScript, right? i.e.

Yes!

May 16 '23 19:05 MoLow

@isaacs I have changed the glob algorithm to something much closer to what node-glob does. PTAL

I am open to adding node-glob as a dependency in the long term, but I think we should first go with this implementation and see how we can get minimatch to work with primordials first, then we can apply the same approach with node-glob and possibly replace this implementation. I am willing to help with that

May 16 '23 19:05 MoLow

@nodejs/test_runner please review

May 16 '23 19:05 MoLow

The https://github.com/nodejs/node/labels/notable-change label has been added by @MoLow.

Please suggest a text for the release notes if you'd like to include a more detailed summary, then proceed to update the PR description with the text or a link to the notable change suggested text comment.

May 16 '23 19:05 github-actions[bot]

I think we should first go with this implementation and see how we can get minimatch to work with primordials first, then we can apply the same approach with node-glob and possibly replace this implementation.

Would replacing the implementation be a breaking change? If so, do we need to mark this as experimental and/or add a warning?

May 16 '23 21:05 GeoffreyBooth

Would replacing the implementation be a breaking change? If so, do we need to mark this as experimental and/or add a warning?

no, it should be compatible - the tests here are taken from node-glob

May 17 '23 05:05 MoLow

What makes it semver-major?

May 18 '23 09:05 targos

Note that tests are failing.

May 18 '23 09:05 targos

What makes it semver-major?

there are some cases (https://github.com/nodejs/node/pull/47653#discussion_r1173663388) where filenames are valid glob patterns.

also - prior to this pr running node --test some/directory would run all test files under that directory, now it wont - since some/directory is a valid glob pattern pointing directly to a directory.

Note that tests are failing.

yeah, some cases dealing with globs with, absolute paths I will try figuring out how that can be tested

May 18 '23 09:05 MoLow

Would you be able to make one or more commits that are not semver-major, so the introduction of the globbing code can be backported for future usage? I know it's more work, but I'm afraid that if we consider this big change entirely semver-major, we will have issues with backports to v20.x

May 18 '23 10:05 targos

I know this may be a lot to ask, but it would be really awesome if this could be non-breaking, so it can be released in full with Node.js 20. I think there may be ways to achieve that, specifically:

there are some cases (https://github.com/nodejs/node/pull/47653#discussion_r1173663388) where filenames are valid glob patterns.

If exact matches have precedence over glob expansions, then the current behavior would be unchanged in these cases? (https://github.com/nodejs/node/pull/47653#discussion_r1174191958)

prior to this pr running node --test some/directory would run all test files under that directory, now it wont - since some/directory is a valid glob pattern pointing directly to a directory

Can we detect that the glob resolved to a directory, and apply the existing test file search logic to it?

May 18 '23 12:05 meyfa

I know this may be a lot to ask, but it would be really awesome if this could be non-breaking, so it can be released in full with Node.js 20. I think there may be ways to achieve that, specifically:

there are some cases (#47653 (comment)) where filenames are valid glob patterns.

If exact matches have precedence over glob expansions, then the current behavior would be unchanged in these cases? (#47653 (comment))

prior to this pr running node --test some/directory would run all test files under that directory, now it wont - since some/directory is a valid glob pattern pointing directly to a directory

Can we detect that the glob resolved to a directory, and apply the existing test file search logic to it?

I do not think this behavior is correct in the long term. besides the performance overhead that this will add, it will be very hard to reason about what files run and why

May 18 '23 12:05 MoLow

@MoLow I'll take a look today.

Regarding matching and handling directories, the pattern that I've found works for node-glob's CLI is that if a pattern matches a path exactly, then that path is returned, and no expansion is performed. Eg:

$ tree a
a/
├── [x-z]
├── x
├── y
└── z

1 directory, 4 files

$ node dist/cjs/src/bin.js 'a/[w-z]'
a/z
a/y
a/x

$ node dist/cjs/src/bin.js 'a/[x-z]'
a/[x-z]

The rimraf CLI, which also has glob support, requires explicit opt-in with a --glob flag, which is a bit less convenient, but rimraf is dangerous, so it seemed wise to make it be a little more careful.

Either approach avoids the footgun of a pattern like app/routes/*.test.ts where both app/routes/[id].test.ts and app/routes/i.ts exist. The pattern will be expanded by posix shells to the actual filename, but not windows shells, so the cli can't know whether that's supposed to be a glob or a path. (Checking SHELL/ComSpec envs or process.platform aren't enough; bash exists on Windows.)

For expanding directories, node-tap does its include/exclude globbing logic based on configs, and then expands directories using a default glob pattern to find all test-ish files in the directory. (Basically, anything it knows how to run, that isn't named fixture{s,}.) I believe other test frameworks have similar approaches, and in my experience, it's not too unintuitive for users. If I have a bunch of files in ./test/*.js, I'd expect node --test test to run them all.

May 18 '23 19:05 isaacs

Oh, one other question, what's the goal of this PR? Should I be reviewing this in the context of "general purpose glob impl that will be exposed to users as fs.glob()", or is this going to be an internal util just for the test runner?

I ask because obviously the bar for performance and resilience is much higher in the former. Probably no one has a million test files or is going to sweat a few ms here or there to find them all.

May 18 '23 19:05 isaacs

Oh, one other question, what's the goal of this PR? Should I be reviewing this in the context of "general purpose glob impl that will be exposed to users as fs.glob()", or is this going to be an internal util just for the test runner?

I have addressed that in this comment: https://github.com/nodejs/node/pull/47653#issuecomment-1550273486 This is currently for the test runner. in the future, I do want to have anfs.glob() method, but that can happen only after we figure out how to use minimatch with primordials - that will effect if and how we can do the same with node-glob

May 19 '23 05:05 MoLow

Would you be able to make one or more commits that are not semver-major, so the introduction of the globbing code can be backported for future usage? I know it's more work, but I'm afraid that if we consider this big change entirely semver-major, we will have issues with backports to v20.x

@targos I have split this into two commits

Jun 22 '23 12:06 MoLow

@nodejs/test_runner I think this is ready, can you review?

Jun 22 '23 12:06 MoLow

CI: https://ci.nodejs.org/job/node-test-pull-request/52344/

Jun 22 '23 14:06 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/52353/

Jun 22 '23 16:06 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/52369/

Jun 22 '23 21:06 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/52373/

Jun 22 '23 22:06 nodejs-github-bot

node node copied to clipboard

test_runner: support passing globs

node
node copied to clipboard