add more flexible, richer, globbing support including exclusion patterns
There has been a considerable discussion on globbing with excludes / negative matches; i.e. something like
find images -mindepth 1 -maxdepth 1 -type f -a ! -name '*.jpg'
or
rm $(ls dir | grep -v '\.gz$')
but less ugly, failure-prone (we should never pipe ls output) and perhaps with more concise and handsome syntax.
There's clearly a case for glob extensions of this kind; apparently that's why bash has shopt -s extglob and zsh has setopt EXTENDED_GLOB. However, these solutions are perhaps better viewed as how not to examples: they never stop producing more and more extensions with practically unflippable "options", they are arcane, hard to remember, hard to search for, unsuitable for recursive directory matches, interfere with other syntax (! is used for history in interactive bash, as well as for negative globs), and so on.
So, in #1444 it was decided not to go in this same direction. Yet another "extended glob" syntax would've make things worse for everyone.
There has also been related discussion in PR #354 which was never merged.
But there needs to be a good answer to How do I cp all the files in dirA except those ending in .iso, in Fish? which hopefully doesn't involve rsync, find or grep -v.
Is it a glob subcommand? Is it a filter built-in for arrays? Is it a regex matcher? Does it allow nested double negations, alternatives and the basic globbing *?
I don't, and can't possibly, know. Your proposals welcome. I'm creating this issue to keep the problem open, and to have a place to track any progress.
There's a few questions here.
The first one (because if we answer it in with a "no", it would obsolete the others), is:
- Do we want a builtin solution to this?
If we don't, we could improve integration of find as the most obvious tool - the one thing that is really missing there is a way to split command substitutions on other characters - \0, in this case, because find $something -print0 | while read -lz element; set list $list $element; end is a bit annoying.
If we do, then we should check what we already have - and for names, string match works well (though the \0-splitting issue remains):
How do I cp all the files in dirA except those ending in .iso, in Fish?
cp (string match -v '*.iso* -- dirA/*)
This allows excluding with "-v" (as above) and also regexes (which means also stuff for character ranges like "[A-E]") with "-r".
So what would remain here is
-
A simple, easy, obvious way to split correctly (on \0 for filenames as the only character that cannot be included)
-
Other ways to filter lists of files (by permission, by size, by age)
-
Optimization - the
string matchsolution would expand the glob entirely before matching - maybe this can be made quicker. Also it would fail on globs with gigantic results (longer than ARG_MAX) even if the filtered result is smaller.
We'd need the first one anyway - see #3164. The second is traditionally finds job, though that has rather weird syntax and I'm not sure how cross-platform it is - I have a GNU version, so presumably that has a bunch of extensions. The third is useful regardless, and I'm not sure how much of a problem it really is.
Or we could add a glob command that takes a "glob expression" (meaning some kind of string that describes the string you're looking for, not necessarily just what we allow as globs currently) without needing a bunch of pipes.
Okay, so the most important option to find to make it work on all filenames - "-print0", is not in POSIX, because:
Using a null terminator meant that any utility that was going to process find's -print0 output had to add a new option to parse the null terminators it would now be reading.
Urgh. It's in FreeBSD's and OpenBSD's find, though, so I'm assuming it's available "everywhere".
Without this, we'd need a builtin.
Well as an exaple I empatically hope fish will not copy, bash invented "extglob" syntaxes: https://www.linuxjournal.com/content/bash-extended-globbing The reason this is horrible is it reinvents a lot of regexps with yet another completely different syntax to learn. EDIT: OP already said all this
I'd much rather rely on the 2 well-understood DSLs for this — ls | grep regexp and find.
Also, there are many more commands that produce file name lists and combine well with grep, such as locate, ag -l, git ls-files etc...
Of course, need to solve splitting on NULs.
find
-print0, is not in POSIX... Without this, we'd need a builtin.
I don't think this follows. A shell doesn't have to implement something that's easily done in user space simply because user space did solve this yet — let alone if user space solved it long ago but is not preinstalled everywhere.
I understand the point of not reinventing the wheel, and that [F-S] rich globbing can be done with grep and find. In Bash it is as simple as mplayer [F-S]* but in Fish that would require a lot of writing and had it been rm [F-S]* then some echo testing also to avoid a typo causing data loose.
Fish does a lot of syntax high lightning and guesses for the next command for the user to save time. Without rich globbing all that time saved is waisted by a factor of many for just one rm [F-S]*.
There are probably two sides to this.
- keep the language clean
- keep the shell fast to use
I would say point 2 is why someone would choose Fish. Rich globbing is such a huge time saver, so even if it is reinventing the wheel on a programming point of view, it is a missing core feature for a shell.
My suggestion here, presuming that there were indeed a need for more powerful globbing, would be to avoid anything that would require more escaping for simple characters or anything that would need constant maintenance and updates each time a minor tweak is needed or a feature needs to be expanded. Globbing is hard - and that's when you're just parsing a string as a glob entry, nevermind as part of a shell command.
How about some sort of native pcre support, akin to JavaScript's /pattern/ syntax (which doesn't even go in quotes, for those that aren't familiar with it, e.g. string.match(/test pattern/), but obviously not using something as common as / is in the shell? 😄
For example, ls @regexp@ (just the first reasonable character that came to my mind). Then people can do whatever they want in between the @ signs using a very well documented, well-understood, and extremely capable "globbing" language without requiring any additional symbols to be blacklisted/reserved (presuming that @.....@ is rare enough syntax).
I would say point 2 is why someone would choose Fish.
Highly doubtful. Fish takes friendliness every time it conflicts with efficiency and the lack of "features" as compared to bash should really illustrate that.
Having a delimiters for rich globbing is a good idea, much better than introducing new escaped characters.
In response to @faho's first comment, I would suggest a different approach:
Most people assume that POSIX "allows" all characters, except '/' and '\0', which is not quite correct:
If you look at the POSIX specification you will find that the portable character set is actually just [a-zA-Z0-9._-]. Every other character is basically an non-POSIX extension, that is not explicitly forbidden.
So, create a path (or whatever) builtin that works much like the string match you used as an example, but that automatically sanitizes pathnames. If a problematic pathname is encountered, a warning is given on the command-line and the pathname in question is not written to stdout. Further processing can then be done using standard tools on a line-by-line basis, such as the string builtin or stest.
Every other character is basically an non-POSIX extension, that is not explicitly forbidden.
Yeaaaah... that doesn't work in practice.
Check the portable character set again - there's no space in there! Now check your $HOME. You have files with spaces in them. This is one of those things that POSIX specifies that just didn't pan out.
Restricting a tool to only work with POSIX "portable characters" is restricting it to work only in theory.
…why do you assume I have space in my $HOME?
Anyway, that's not the point. After all, the only problematic character for shell scripting is \n and as far as I am aware, the only reason to put that into a filename is to break someone's computer.
I think @faho was saying check the contents of your $HOME folder and not its literal value, ie files on your PC.
I keep getting bit by this missing feature in my daily use.
Things like "cp all the files except those ending in .iso" is surprisingly common in day to day use. It's the reason GUI shells have Ctrl-A to select all files and Ctrl-click to manually de-select some.
What I would use in bash is [^abc], but any non-trivial use case results in super-ugly and unreadable code:
cp *[^o] *[^s]o *[^i]so *[^.]iso target/ # yes, this would copy all files except *.iso
# assuming you have enough files to fill the holes and/or nullglob set
What I would like to say to Fish is * except *.iso, but the problem is that both globs expand to lists and I can only see very few ways of dealing with multiple lists at the same time.
Method 1. Don't deal with multiple lists at all, aka. quote one of the two patterns or both. Examples:
cp (glob --except=\*.iso \*) target/ # quote both patterns?
cp (glob --except=\*.iso *) target/ # only quote the exception pattern?
cp (glob -x\*.iso *) target/ # compact option
cp (except \*.iso *) target/ # more specific verb
This is kind of a cop-out, but I would keep it in mind as a last resort. This has the benefit of not needing any change to Fish, as it can be implemented as functions. I'd wager it probably already exists in one form or another in some user's functions folder.
Method 2. Parse some special combination of characters as a delimiter between the two lists. Examples:
cp (glob * --except *.iso) target/ # not very good
cp (glob * \0except *.iso) target/ # maybe solves the problem, but is ugly
cp (glob * '' *.iso) target/ # solves the problem, but is hermetic
The first example has the serious drawback of failing on directories that contain a file named --except. The risk can be reduced by using more exotic combination of characters as the delimiter token, including possibly the empty string or \0, the only byte that cannot appear in file names, but this would make it ugly and unreadable.
Unless I'm missing another way around it, this method is not going to work.
Method 3. Add "macro" or "meta-programming" support to Fish, meaning a special type of function that is called in a different (earlier) execution phase than normal functions. This would allow writing a "glob" macro (not function) that will receive the patterns un-expanded and handle them in a specific way:
cp (glob * --except *.iso) target/
macro glob
# this is a macro, therefore it will receive an un-expanded $argv, such as:
# set $argv \* --except \*.iso
...
end
Meta-programming is very powerful (see Scheme / LISP) and not hard at all to add to a language engine, but it can alter the flavour of the language substantially.
As an aside, if we go down this route, should macros receive their entire input line, until end of line or enclosing matching parentheses? This would allow implementing alternative redirection and/or pipe syntax, for example:
cp (newpipe some-command |> some-pipe | more-complicated <| pipe-syntax) target/
macro newpipe
# this would receive its entire input until end-of-line or matching ')':
# set $argv some-command '|>' some-pipe '|' more-complicated '<|' pipe-syntax
...
end
Maybe a few injection points could be figured out (before pipe and redirection parsing, before brace expansion, before glob expansion...) and allow each macro to specify which phase of input parsing it want to receive.
Method 4. Add very specific (and hopefully well thought-out) new syntax, just to handle subtraction (and possibly intersection) of glob patterns. We already have union of glob patterns in the form of brace expansion: {one,two,three}*.iso.
This would be the most specific method of addressing this issue. Instead of hacking around existing limitations (methods 1. and 2.) or making Fish a more generally powerful but more surprising language (method 3.) this would provide (and fix in stone!) a new specific piece of syntax just to write more powerful glob expressions.
Advantages of this method would be providing a very compact syntax, and implementing the exclusion (or intersection) logic inside the core mechanism for globbing, written in C++, making this the most performant method of all.
One way to go about it, in order to break the least amount of existing code, would be to add the new syntax inside the existing brace expansion delimiters {...} For example, we could use an inner prefix of ^, which is hopefully used very little in existing code, to indicate a negation of the enclosing glob:
cp *{^*.iso} target/
This looks similar to negated character classes in regular expressions [^...] but it would work in a different way. {^...} would not be a valid glob by itself, it would only be accepted at the end of an existing pattern, and it would implement subtraction (set difference) from its enclosing glob expression:
rm *{^*.iso} # all files except *.iso
rm *.iso{^a*} # all *.iso files except those beginning with 'a'
rm *{^a*,b*,c*} # all files except those beginning with 'a', 'b' or 'c'
Of course, if anybody is using {^foo} to mean literally {^foo}, they would be in for a surprise. But it's hopefully rare enough that a warning of deprecation, followed by removal a few releases after that, should be enough to contain the damage.
While we are at it, another extension that comes to mind would be intersection, for example using the & character:
rm *.iso{&*foo*} # all *.iso files that _also_ contain 'foo'
The latter would probably be less useful, because in most cases you can rearrange the pattern (eg. *foo*.iso) but it cannot always be done and it doesn't hurt to think forward. Someone will find a use for it.
What do you think?
glob * --except *.iso
string match -v \*.iso -- * should do that. I have never thought of that trick when I needed it, though.
string match -v \*.iso -- *should do that. I have never thought of that trick when I needed it, though.
I just wrote this:
function but
# Parse arguments
if test (count $argv) -lt 2 || string match -qvr '[*?{]' $argv[1]
echo "Usage: but QUOTED_GLOB GLOBS..." >&2
echo "Example: cp (but \*.iso *) target/" >&2
return 2
end
set except $argv[1]
set -e argv[1]
set globs $argv
# Perform match
string match -v $except -- $globs
end
It works like this:
cp (but \*.iso *) target/
But I don't know how to make the outer command fail when it's misused:
cp (but *.iso *) target/
My code is detecting the missing \ (heuristically) and returns with an error, but cp is executed all the same.
Since https://github.com/fish-shell/fish-shell/issues/839 got closed in favour of this issue, I would like to make sure that it’s point does not get forgotten.
The issue there is that in Fish currently one cannot use alpha/letter ranges.
E.g. one would expect for something like this (may use a more fish-like syntax) to remove all files that start with letters a to f.
rm [a-f]*
I've been thinking about a glob builtin that would expose the glob(3) C library function, including ? and character classes.
I'd add that to #7658. Make a built-in that deals with paths, including matching.
I'd be more in favor of a friendly regex-based globbing syntax to address both @silverhook's and @zanchey's requests. It avoids creating an inferior DSL.
Do you mean adding a builtin that supports regex-based syntax, or adding the actual syntax?
Because adding syntax here is an annoying, breaking, change, while adding a builtin can easily be done, and using glob(3) makes it reasonably easy to implement. We could then add a regex-mode on top, but doing that in a performant manner won't be easy.
Sorry, I was referring to actual parser-backed syntax. If you mean a builtin with glob support, then what you're saying is fine.
One thing that would work would be disabling glob syntax for that builtin, so things like *foo* are passed along to it literally, so it can then handle the globbing.
Otherwise you'd have to awkwardly
path match '*foo*' # quotes or this expands before being passed, so if a path includes a `*` that will then be re-expanded
Yeah, I was thinking about that. Parser exceptions are a minefield though.
One thing that would work would be disabling glob syntax for that builtin, so things like
*foo*are passed along to it literally, so it can then handle the globbing.
That would be my proposal 3. above: Add "macro" or "meta-programming" support to Fish.
I’m reluctant to embrace parser exceptions based off of the command as it would be a major breaking change and will make the mental model much more difficult. Someone just looking at the script would expect it to behave very differently than it actually does, unless they were aware of each and every special case and how it is handled.
I think I would prefer an RFC for new syntax over command-specific parser escapes.
An ugly - but very pragmatic - alternative occurred to me: if you set aside switches, a default abbr to turn a hypothetical glob built-in from glob to glob ' could guide people to entering it correctly (but I hate unexpected manipulation of the command line, especially as a touch typist - it would probably have to act like an IDE with brace/parenthesis elision would and ignore/coalesce a ' entered in quick succession to avoid ending up with glob ''.
I’m reluctant to embrace parser exceptions based off of the command as it would be a major breaking change and will make the mental model much more difficult.
Yeah I've been testing this, and if we went with path match for globs and other subcommands for other path operations the exception would be weird or annoying.
So I'd be against that one.
Another issue is that, while glob(3) exists and does a lot of stuff like character classes and ranges (foo[1-9]), it does not do ** recursive globs (like the ones bash does with "globstar" and we do by default). So path match would in one sense be less powerful than the always-available globs.
Other than that, it's reasonably intuitive and usable.
One of the few things that I miss from zsh globbing is an expression like vim **/*(.), which means "edit all (actual) files", as determined by stat. I don’t miss most of the zsh suffix glob expressions (du -sh **/*(/N:h) is ridiculous and nigh-unreadable, and I probably meant :p and not :h), but being able to use the recursive glob and only select files would be extremely useful from my perspective.
Subscribing to this discussion to follow, because I was just bitten trying to type git add chap1[345] in fish shell, because I wanted commit the files in the directories chap13, chap14, and chap15, while ignoring chapters 01 through 15.
Once I was reminded of the fact that fish only supports * and ?, I used git add (string match -r chap1[345] *) to accomplish my goal.
Of course, by the time I spent reading the other GitHub fish shell issues related to regular expression globbing, and re-read the documentation on 'string match', I could have manually typed git add chap13 chap14 chap15 many times. 😜
I may write a short fish function wrapper around string match -r <pattern> * for my own use, so that it's easier to remember.
I could have manually typed git add chap13 chap14 chap15 many times.
Or, you know, git add chap1{3,4,5}.
Or, you know, git add chap1{3,4,5}.
However, I think it would be nice to be able to do git add chap{1..12} instead of git add chap{1,2,3,4,5,6,7,8,9,10,11,12} to get the other case.
Or, you know, git add chap1{3,4,5}.
However, I think it would be nice to be able to do
git add chap{1..12}instead ofgit add chap{1,2,3,4,5,6,7,8,9,10,11,12}to get the other case.
Wouldn’t that be git add chap(seq 1 12)?
Or, you know, git add chap1{3,4,5}.
However, I think it would be nice to be able to do
git add chap{1..12}instead ofgit add chap{1,2,3,4,5,6,7,8,9,10,11,12}to get the other case.Wouldn’t that be
git add chap(seq 1 12)?
Why not both?
Doesseq work also with a-z?
Ranges are #1187, btw.