tooling icon indicating copy to clipboard operation
tooling copied to clipboard

glob support in core

Open boneskull opened this issue 6 years ago • 37 comments

I'm curious if anyone thinks adding glob support to core would be a good idea?

Globs (globspecs? what's the right term?) are natively supported in POSIX (?) shells, which means you can do this:

$ some-cli-app **/*.foo.js

This is expanded by the shell before execution of some-cli-app into a list of files matching the glob.

But we get in to trouble when our scripts (e.g., in package.json) use globs, because they will not be expanded, and the scripts will fail (or otherwise exhibit unexpected behavior) on Windows.

To support globs in a portable way, any CLI app would need to pull in glob or a similar module, and pass the glob as a quoted string in the arguments:

$ some-cli-app "**/*.foo.js"

(On Windows, you must use double-quotes above, but using single-quotes in POSIX shells means env variables [e.g., $FOO] will be treated as literals; env variables are another can of worms)

The utility of glob has been well-established, and it's used in enough CLI apps that it may be a good candidate for vendoring or some other implementation. I haven't looked at the glob source, so it's unclear whether vendoring would be the best route, but I do know it contains several other dependencies which would also need to be pulled in (right @iansu?).

Thoughts?

cc @isaacs

boneskull avatar Sep 03 '19 18:09 boneskull

I do think this is a good idea but as you point out glob does have 6 dependencies, including minimatch so it's unclear to me how we would vendor it.

iansu avatar Sep 03 '19 18:09 iansu

globs are not supported by bash v5 on Fedora 30. Running some-cli-app **/*.foo.js will produce the same result as some-cli-app */*.foo.js - the double-star is not supported.

As for pulling glob into the core I'm not sure. I think it might be more appropriate to first ask if minimatch functionality belongs in the core. If so I think it would be important to evaluate pros/cons of multiple modules with similar functionality to determine which one would be best for the core (minimatch and micromatch each have about 15 million downloads per week).

coreyfarrell avatar Sep 03 '19 18:09 coreyfarrell

Glob support would be great precisely because it’s not consistent across platforms, but is needed for things like git/npm/eslint ignore, eslint overrides, etc.

ljharb avatar Sep 03 '19 18:09 ljharb

As often as globbing is used, and since most developers really don't realize how many edge cases and platform-specific issues need to be considered to produce a consistent experience, IMHO it would be great to add native support.

If Node.js does decide to include this functionality, I would be willing to put some time into helping (assuming that's of any interest). Either way, my recommendation is that you look at the matching and fs operations as completely separate chunks of logic.

  • minimatch and picomatch are both focused on the former (just the matching)
  • glob is focused on the latter (but is currently coupled to minimatch)

If it's decided that this functionality should be re-written directly in Node.js, I'd be happy to take a stab at contributing that logic, starting with just the matching. I think @isaacs has more experience and knowledge of the fs logic and cross-platform considerations than I do. But I'd be willing to help with that as well.

If you decide to use external libs for any of this, I would of course also be willing and interested in collaborating on making any adjustments, improvements, or changes to our libs that are necessary to make them meet your requirements.

jonschlinkert avatar Sep 03 '19 19:09 jonschlinkert

I would recommend also taking a look at fast-glob which is both faster and better maintained than glob.

// @mrmlnc

sindresorhus avatar Sep 04 '19 07:09 sindresorhus

Let's not throw subjective shade at other packages, though.

ljharb avatar Sep 04 '19 07:09 ljharb

sorry just saw your comment @ljharb (after I did thumbs up)

jonschlinkert avatar Sep 04 '19 08:09 jonschlinkert

We're meeting tomorrow and this will be on the agenda. To everyone offering comments: please attend if you're available!

boneskull avatar Sep 05 '19 17:09 boneskull

@sindresorhus It sounds like you'd be in support of pulling fast-glob into core, then?

boneskull avatar Sep 05 '19 17:09 boneskull

@boneskull I'd be in support of that decision too :)

jonschlinkert avatar Sep 05 '19 18:09 jonschlinkert

@ljharb It was not my intention to "throw a subjective shade". I'll be better at justifying my opinions in the future. I just wanted to mention an alternative.

Here are benchmarks comparing glob and fast-glob: https://gist.github.com/mrmlnc/f06246b197f53c356895fa35355a367c#file-fast_glob_benchmark_server-txt

By "better maintained", I meant more actively maintained. Poor word use on my part. glob has almost not had any real code commits in 3 years.

For context, I built globby (15M weekly downloads now) some years ago to provide a subjectively nicer API for glob with a Promise API, support for multiple patterns, negated patterns, etc. Then fast-glob came along which fixed a lot of bugs found in glob and I eventually switched globby over to use fast-glob. In my subjective opinion, fast-glob has a better API and codebase.

sindresorhus avatar Sep 05 '19 18:09 sindresorhus

It sounds like you'd be in support of pulling fast-glob into core, then?

Yes

sindresorhus avatar Sep 05 '19 18:09 sindresorhus

OK, we discussed this at today's meeting

  • Both glob and the underlying matcher should be available for consumption
  • To start, let's create a PR which adds matching support (a la minimatch or micromatch)
  • Suggested API: path.match()
  • "gitignore-style" matching is also widely used, and this should be a use case
  • @iansu and/or myself will provide a comparison of micromatch vs. minimatch vs. bash vs. gitignore to find the intersection of features; we can narrow down an API from there
  • After this, there are several options w/r/t adding glob support to fs, and we can cross that bridge later

gimme feedback

boneskull avatar Sep 06 '19 19:09 boneskull

I would be thrilled to hand glob over to node core.

That being said, there are a few rough edges that ought to be polished down first, so here I go brutally trashing my own work :)

The sync/async management is (overy) clever. It works fairly well, but it's not my favorite approach. Since it somewhat predates proper Classes in JavaScript, glob uses a different approach than most of my other more recent libraries that walk over trees. (Eg, tar, npm-packlist, etc.) The code is a bit brittle and hard to work with. It's also very (overly) optimized for performance, and meticulously designed to match Bash 4's glob behavior.

None of which would in itself be a problem, if glob was actually a good cross-platform general-purpose file system globber. We could call it done and let it age gracefully. However, it has some issues with UNC paths on Windows, and lacks proper promise support, both of which are very challenging to fix in its current state.

I'd love to give it a rewrite, and may someday, but honestly, if it was a priority, I probably would've done it by now. Maybe the best approach is to use it as a reference implementation, but rewrite it, fixing all the bugs and preserving the tests and actual glob resolution behavior. It works well enough for a lot of people right now (including me!), but caveat emptor: if you pull it in as-is, you might be welcoming a very stinky camel into your tent. Make sure it's not too hard to replace, if you do so.

isaacs avatar Sep 06 '19 23:09 isaacs

By "better maintained", I meant more actively maintained. Poor word use on my part. glob has almost not had any real code commits in 3 years.

"Better maintained" is actually a fair and accurate assessment. No shade taken :)

isaacs avatar Sep 06 '19 23:09 isaacs

I'm working to document this initiative, so here's a rough draft of what I have, including a comparison of existing solutions. If I'm wrong here, please say so. I'll send a PR after it's had a bit of review, but wanted to keep everything in this issue for now.


Globbing in Node.js core

See nodejs/tooling#38.

Motivation

TL;DR: Runtime handling of Globs is needed for better portability

Globs (henceforce capitalized) are natively supported in shells like Bash, which means you can do this, which will recursively find files matching *.foo.js, from your current working directory:

$ some-cli-app **/*.foo.js

Bash expands the above before execution of some-cli-app into a list of files matching the Glob.

But we get in to trouble when our scripts (e.g., in package.json) use Globs, because Windows won't necessarily expand these, and the scripts will fail (or otherwise exhibit unexpected behavior).

To support Globs in a portable way, any CLI app would need to pull in glob or a similar module, and pass the Glob as a (double-)quoted string in the arguments:

$ some-cli-app "**/*.foo.js"

The utility of glob and others like it has been well-established, and it's used in enough CLI apps that the functionality belongs in Node.js core.

Roadmap

The first thing that needed is a matching implementation. A Glob implementation (which uses the matcher to compare filenames) would then follow.

We expect the new function(s) would be added to the path module. That said, we must be careful to not conflate filepaths and Globs, because they are not the same.

Matching Implementation

An underlying matching implementation is a prerequisite to Glob support in Node.js core. To determine what we should be implementing, we'll compare existing solutions.

Of particular concern is how path separators are handled. While Globs are not filepaths, developers often treat them as such. The output of path.join() will be system-dependent, and if its output is passed into a matching implementation, the result may be unexpected. Please see micromatch's notes on the topic.

Comparison of Prior Art

We'll compare the featureset of the following:

We've identified the latter two as being most popular packages which provide the underlying matcher to a Glob implementation.

We will consider the following relevant features:

  • Wildcards (*.js)
  • "globstar" (**.js, a/**/*)
  • Negation (!a/*.js)
  • "extglob" (+(x|y), !(a|b))
  • Character set (foo-[12345].js)
  • Character ranges (foo-[1-5].js)
  • Character classes ([[:alpha:][:digit:]])
  • Brace expansion (bar/{a,b,c}.js)
  • Brace expansion ranges (foo/{1..5}.md; equivalent to foo/[1-5].md)
  • Case-insensitive matching
  • Comments (#*.js)
  • Single-character (foo.?s)
  • Implicit dotfile matching

For purposes of this best-effort comparison, we'll consider Bash's full capabilities instead of its default capabilities. All comparisons should reflect the latest version (as of this writing) of the respective software.

If a user can toggle the specific behavior, that's noted with "Flag."

Wildcards Globstar Negation extglob Char set Char ranges Char class Brace Expansion Brace Ranges Case Comments Single-Char Dotfile
Bash Yes Flag Yes Flag Yes Flag Yes Yes Nob Flag Yes Yes No
gitignore Yes Yes Flaga No Yes Yes Yes Yes No No Yes Yes No
micromatch Yes Flag Flag Flag Flag Flag Flag Flag Flag Flag Noc Yes Flag
minimatch Yes Flag Flag Flag Yes Yes Yes Flag Flag Flag Flag Yes Flag
  • a: Use \! at the beginning of a pattern for a literal !
  • b: Bash treats "brace expansion ranges" ({1..5}) as sequences which can be used to create multiple patterns at once, but are not patterns themselves.
  • c: If micromatch does support comments, it's undocumented.

boneskull avatar Sep 17 '19 22:09 boneskull

My key takeaway: if you expected the implementation to be simple, you will be disappointed. :smile: I don't think we can get away with not offering a handful of flags, either...

boneskull avatar Sep 17 '19 22:09 boneskull

c: If micromatch does support comments, it's undocumented.

Correct, micromatch doesn't support comments.

My key takeaway: if you expected the implementation to be simple, you will be disappointed. 😄 I don't think we can get away with not offering a handful of flags, either...

Indeed. I recommend also testing nested braces and extglobs, as well as braces and extglobs with nested negation patterns. Nested patterns are used more often than one might guess.

Regarding bash, both minimatch and micromatch have (intentionally) different behavior in some cases.

Also, it's important to note that .gitignore has completely different matching rules than glob. Both minimatch and micromatch can be used for matching gitignore patterns, but neither library has native support. FWIW I haven't seen any node.js libraries do this correctly yet.

jonschlinkert avatar Sep 18 '19 00:09 jonschlinkert

Another thought.... joining paths to globs is not trivial, but it's totally possible to do it correctly, consistently, in a cross-platform compatible way.

jonschlinkert avatar Sep 18 '19 00:09 jonschlinkert

@jonschlinkert

Another thought.... joining paths to globs is not trivial, but it's totally possible to do it correctly, consistently, in a cross-platform compatible way.

In your opinion, does micromatch do this correctly?

Also, it's important to note that .gitignore has completely different matching rules than glob. Both minimatch and micromatch can be used for matching gitignore patterns, but neither library has native support. FWIW I haven't seen any node.js libraries do this correctly yet.

It seems it's implemented in its internal wildmatch.c. There's a JS port of wildmatch but I can't vouch for its accuracy. It also hasn't been published in 5 years.

boneskull avatar Sep 18 '19 17:09 boneskull

@jonschlinkert I'm curious. If you were going to pare down micromatch to its essence (there are a lot of options, as you'll probably agree), what would you keep? What would you drop?

(Since we're aiming to determine our "MVP", as it were)

boneskull avatar Sep 18 '19 17:09 boneskull

IMO, in our implementation, I could definitely do without:

  • implicit dotfile support (a flag that essentially ignores the leading dot of a filename when matching)
  • comments--what is the use case for this outside of a script or .gitignore file? multiline globs don't make much sense in-and-of themselves unless newlines are stripped anyway, right?
  • brace expansion ranges ({1..5}); too confusing in this context. in minimatch, this is equivalent to [1-5]. micromatch seems to have more granular control over the behavior. it veers into "shell expansion" territory. I'm not sure if this is needed at all, but I'm happy to be convinced otherwise

boneskull avatar Sep 18 '19 17:09 boneskull

pls note that on Friday we have a meeting and everybody is invited. bring a +1 if you want

boneskull avatar Sep 18 '19 17:09 boneskull

Same question as in #19: what can node core do better than userland? I'm not opposed, but I do think node core can aim higher than "users won't have to install glob" and that those goals should be more clearly formulated before diving into specifics.

vweevers avatar Sep 18 '19 18:09 vweevers

In your opinion, does micromatch do this correctly?

Micromatch doesn't do any kind of joining, it's all left to the user to figure this out - same as minimatch and node-glob. However, I did start implementing this behavior somewhere, and I've given examples in many issues over the past few years, and I'd be happy to discuss over code when the time comes.

In essence, it boils down to ensuring that all slashes are converted to posix slashes in a non-glob file path before converting it to a glob pattern. There are other nuances that would warrant a longer discussion.

If you were going to pare down micromatch to its essence (there are a lot of options, as you'll probably agree), what would you keep? What would you drop?

My first reaction is to drop brace expansion, like you mentioned. Especially since globs with braces can be expanded before passing them to the matcher. (as a side note micromatch expands {1..5}} to [1-5], minimatch expands {1..5} to ['1', '2', '3', '4', '5']). But these are contrived examples. Real world patterns are closer to things like account-S{001..990}, which would expand to account-S(0{0,2}[1-9]|0?[1-9][0-9]|[1-8][0-9]{2}|9[0-8][0-9]|990) with micromatch, and would be an array of 990 globs with minimatch).

Overall, as you think about which features to include or drop, I think the most important consideration is how globs are used. With CLI tools like eslint, end-users don't have the ability to pass options to the matcher, or customize behavior by, for example, expanding braces beforehand. Removing a bunch of features or options will just make globbing useless or painful to a lot of end-users, or cause implementors to find hacky workarounds. I found this out the hard way over the past few years. Globbing is hard to do correctly, and implementors always seem to underestimate the number of edge cases their users will experience with it (especially because of Windows paths, and the general misconception that globs are just fancy file paths). This isn't an argument for not having native glob support in Node, it's the opposite. I think we can smooth this out and create consistency for users.

If we look at this from 20,000 feet, what we think of as "globs" in Node.js is actually a combination of several different matching concepts, some from shell expansion, others from elsewhere:

  • filename expansion - which mostly includes wildcards: *, ?, and "globstars": **
  • brace expansion - which you pointed out above: {a,b,c} => ['a', 'b', 'c']
  • posix character classes - i.e. [:alpha:]
  • extglobs - !(foo|bar)

Beyond features, there are a handful of things to consider in determining behavior:

  1. Each implementation has slightly different rules - bash and zsh even use different rules
  2. Each has different options
  3. Some features, in combination with certain options, contradict one another
  4. There is no canonical specification for any of these (at least I have been unsuccessful in finding one)
  5. Expected behavior differs in some scenarios when using globs in Node.js versus bash.

there are a lot of options, as you'll probably agree

I do. I really, really do. lol. When I created the parser for the latest micromatch, I had to make decisions about how the potentially-conflicting rules of each matching concept should work together. Instead, I avoided making those decisions by allowing the user to do it via options.

If micromatch is pulled into Node, I would be more than happy to reduce options or hard-card certain behavior. This is not expected of course.

jonschlinkert avatar Sep 18 '19 19:09 jonschlinkert

Almost forgot, picomatch is the matcher behind micromatch.

(edit: I initially said that picomatch doesn't have brace support, but I forgot that it does in the latest release. However, micromatch adds support for more complex brace expansion features).

jonschlinkert avatar Sep 18 '19 19:09 jonschlinkert

@vweevers That depends what you mean by "better," I suppose. Node.js doesn't provide a "better" way than userland to build services out-of-the-box; but its net and httpx modules enable userland to build better services. While yours is a valid question, it's not the only question, and adding something to core should not hinge on the answer.

Another question to ask is "what's essential?" Recursive mkdir and rmdir--as well as argument parsing in #19--are essential for tools which operate on files, which is quite a lot of them.

So, glob support is essential, because:

  • It enables better cross-platform tooling. Node.js has struggled with portability in several areas, and contributing to a better story here seems like a clear win.
  • Tooling authors have to use glob/fast-glob/globby if they want more consistent cross-platform behavior anyway, which they do (or if they don't, they should).
  • Depending on implementation, it could be a performance improvement over pure-JS userland modules. I'm not necessarily suggesting to write it in C++, but it'd be an easier sell to a consumer than a native userland module.
  • Enables potential future integration with fs module. I don't want to speculate on what this'd look like, but it'd open up the opportunity.
  • Potential future integration with #19.
  • Provide a "standard" or "blessed" way to handle slashes ("consistency", as @jonschlinkert mentions above)

boneskull avatar Sep 18 '19 19:09 boneskull

FWIW, Python's glob support is pretty barebones (it looks like they added globstar support in v3.5). I haven't used it myself, so don't know how Python tooling authors are getting along with it, but I'd sure like to know.

boneskull avatar Sep 18 '19 19:09 boneskull

IMO, in our implementation, I could definitely do without:

  • implicit dotfile support (a flag that essentially ignores the leading dot of a filename when matching)

What about modules like gulp where it might be desirable to match dot files (or files inside dot directories). For gulp.src it's probably easy enough to explicitly add patterns for dot-files and dot-directories which need to be copied but dot: true is easier than managing the extra glob patterns.

  • comments--what is the use case for this outside of a script or .gitignore file? multiline globs don't make much sense in-and-of themselves unless newlines are stripped anyway, right?

The one exception could be to work around the fact that JSON does not support comments. Completely ignoring any pattern matching /^#/ could allow comments in JSON glob array fields, for example:

{
  "files": [
    "**/*.js",
    "# reason for not publishing file.js",
    "!file.js"
  ]
}

I know this could be accomplished by pre-processing the array before passing to the matcher function but the idea here is consistency, for globs to eventually work the same everywhere. I don't see any value in supporting inline comments, for example "!file.js # reason for this ignore".

  • brace expansion ranges ({1..5}); too confusing in this context. in minimatch, this is equivalent to [1-5]. micromatch seems to have more granular control over the behavior. it veers into "shell expansion" territory. I'm not sure if this is needed at all, but I'm happy to be convinced otherwise

If this is supported should it automatically expand range as per the example at https://www.npmjs.com/package/micromatch#optionsexpandrange? Seems confusing that micromatch.makeRe('{10..15}') produces /^(?:[10-25]\.js)$/ by default instead of /^(?:(1[0-9]|2[0-5])\.js)$/.

coreyfarrell avatar Sep 19 '19 07:09 coreyfarrell

it's probably easy enough to explicitly add patterns for dot-files and dot-directories which need to be copied but dot: true is easier than managing the extra glob patterns.

Agreed. For users, it's much more difficult to exclude unwanted dotfiles when they are included by default, than the other way around. Also, every globbing implementation I'm aware of excludes dotfiles by default. In bash, for example, you need to set dotglob to match hidden files.

It's also really common for users (or implementors) to use patterns like * to match all of the files in a directory, forgetting that some files are hidden (like the .DS_Store file in every directory on MacOS).

Seems confusing that micromatch.makeRe('{10..15}') produces /^(?:[10-25]\.js)$/ by default instead of /^(?:(1[0-9]|2[0-5])\.js)$/.

Agreed. I've been planning to change that to the default. Thanks for bringing it up.

Python's glob support is pretty barebones

That depends on what you mean by bare bones lol. It supports globstars, ranges like [1-5], and even environment variable expansion. As a side note, I did this on a branch on picomatch, after a discussion with the jest core team They support basic path variables, and it's hard to expand them correctly. It turned out to be really useful, and fast, since it only adds 5 or 6 lines of code to the parser.

jonschlinkert avatar Sep 19 '19 18:09 jonschlinkert