msys2-runtime
msys2-runtime copied to clipboard
Command argument parsing bugs vs regular win32 tools
The MSYS2 runtime does some amount of interpretation of its own of the incoming command line string (which mostly is a feature, I guess). However the MSYS2 (or cygwin I presume) specific argument parsing breaks a number of details regarding how generic win32 processes parse their command line string.
The LLVM testsuite runs a large number of test cases, that involve invoking executables such as sed and grep, where the arguments may involve tricky strings with regexes and similar, where the exact escaping/unescaping of tricky characters is cruicial. The LLVM testsuite doesn't know if the sed or grep (or other shell utils) invoked are regular win32 executables (e.g. built with mingw, e.g. provided by the GnuWin tool suite) or MSYS2 based ones (in particular, LLVM's testsuite can automatically locate the set of tools installed as part of Git for Windows, as they generally fit the bill perfectly and often can be found installed on users' systems).
The LLVM tests have run into a number of cases where the regular win32 argument quoting/escaping rules don't work with MSYS based tools; some of the issues can be avoided by choosing a different quoting/escaping strategy, but overall there doesn't seem to be a single strategy that works consistently with both MSYS2 based tools and regular win32 based tools.
-
If one wants to pass the argument
a"bto a regular win32 process, it is escaped asa\"b, which after argument parsing/unescaping gets parsed back intoa"b. If a MSYS2 based process is passeda\"b, it receivesa\binstead. This particular bug can be avoided by quoting the whole string instead; both win32 and MSYS2 based tools parse"a\"b"asa"b. -
If one wants to pass the argument
a[b\cto a regular win32 process, it doesn't need any extra escaping. If a MSYS2 based process is passed the argumenta[b\c, the executable receives the argumenta[bc. In this case, quoting the whole argument string, into"a[b\c"makes both win32 and MSYS2 based tools receive the same, expected argument. -
If one wants to pass the argument
a\b\\c\\\\dto a regular win32 process, it doesn't need any extra escaping either, and both win32 and MSYS2 based tools receive the same, the expected argument. However, if one quotes the whole argument (as a workaround to 1. or 2. above), the MSYS2 based tool receivesa\b\c\\dinstead, which wasn't what was intended. -
As a separate workaround to 2. and 3. above, one can choose to instead set the env variable
MSYS=noglob. That works around both issues 2. and 3., but breaks issue 1. above again, making botha\"band"a\"b"be parsed asa\binstead ofa\"b.
Or more consistently, here are the sets of inputs and parsed outputs:
| # | Input | MSYS2 default | MSYS=noglob | win32 |
|---|---|---|---|---|
| 1 | a[b\c |
a[bc |
a[b\c |
a[b\c |
| 2 | a\b\\c\\\\d |
a\b\\c\\\\d |
a\b\\c\\\\d |
a\b\\c\\\\d |
| 3 | a\"b |
a\b |
a\b |
a"b |
| 4 | "a[b\c" |
a[b\c |
a[b\c |
a[b\c |
| 5 | "a\b\\c\\\\d" |
a\b\c\\d |
a\b\\c\\\\d |
a\b\\c\\\\d |
| 6 | "a\"b" |
a"b |
a\b |
a"b |
So there doesn't seem to be any combination of quoting or setting MSYS=noglob that consistently reproduces what's expected (what the win32 equivalent app receives). By quoting case 1 and 3 into 4 and 6, but not quoting case 2, one can get something that works, but that's extremely brittle, as if case 2 requires quoting the argument for other reasons, it breaks again.
To reproduce, build a small test app like this with both the MSYS2 runtime and mingw, and try invoking it (e.g. from cmd.exe) with the various test input arguments:
#include <stdio.h>
int main(int argc, char **argv) {
for (int i = 0; i < argc; i++)
printf("%d: %s\n", i, argv[i]);
return 0;
}
I presume this is mostly an upstream cygwin issue and there's not much you want to try to do about it at the MSYS2 level, but I thought I'd file it here first, for visibility and input, before proceeding further upstream.
@dscho - I've been told you might have some insight into these parts of the runtime
(there was #8, not sure if related and if it was ever applied upstream)
(there was #8, not sure if related and if it was every applied upstream)
Oh, awesome - without testing it myself yet, that does sound from the outset like it fixes exactly this. Unfortunately, the last iteration of the patchset upstream seems to have been in November, https://cygwin.com/pipermail/cygwin-patches/2020q4/010773.html, and there doesn't seem to be any update on it.
I am really concerned about a change like this. It is prone to break existing setups. Look e.g. at the lengths Git for Windows goes to quote things accurately for MSYS2: https://github.com/git-for-windows/git/blob/v2.31.1.windows.1/compat/mingw.c#L1452-L1480
If you change this, your desires might be addressed, but think about all the problems you cause.
If you want to introduce this as an opt-in (via a new MSYS knob, for example), you have my support, otherwise I would like to veto this on behalf on the millions of users out there.
I am really concerned about a change like this. It is prone to break existing setups. Look e.g. at the lengths Git for Windows goes to quote things accurately for MSYS2: https://github.com/git-for-windows/git/blob/v2.31.1.windows.1/compat/mingw.c#L1452-L1480
Thanks for pointing out this routine - so this works around the existing issue by inspecting (or guessing based on the path?) whether the process to spawn is an msys based executable, and applies custom quoting logic for that case? That confirms what I concluded from the LLVM testing point of view also, that to reliably quote any tricky argument with msys callees, one has to resort to that. I made such a PoC for the LLVM tests too (that inspects binaries and checks whether they link against the msys dll) - so far those tests have managed to avoid that by pure luck by using a quoting strategy that happen to work right for the sets of inputs that are processed, so I'd rather avoid adding such a routine there too as long as it can be avoided.
If you change this, your desires might be addressed, but think about all the problems you cause.
Indeed, fixing the root cause here breaks any user that adapts to the current detailed parsing behaviour of msys executables.
If you want to introduce this as an opt-in (via a new
MSYSknob, for example), you have my support, otherwise I would like to veto this on behalf on the millions of users out there.
Interestingly enough, I didn't see this issue raised in the review in upstream cygwin so far, except for the last bullet at https://cygwin.com/pipermail/cygwin-patches/2020q2/010307.html (which seemed to go unanswered). It's not merged there though.
But keeping both codepaths and making the new form an opt-in feature sounds like a sensible way to go, to allow a gradual transition. In any case, I presume msys2 isn't willing to take on such new code until it's at least merged and internalized upstream.
Interestingly enough, I didn't see this issue raised in the review in upstream cygwin so far, except for the last bullet at https://cygwin.com/pipermail/cygwin-patches/2020q2/010307.html (which seemed to go unanswered). It's not merged there though.
My reading on it is that they're not really enthusiastic about taking that patch.
And I don't fault them. The most important impact of that patch (if my recollection is correct) is that calling Cygwin programs from outside Cygwin will use the new command-line parser. IIRC when calling a Cygwin program from another Cygwin program, the arguments are passed in a different way that does not need quoting on the caller side and unquoting on the callee's.
And since that is not necessarily a scenario that is really interesting to Cygwin (the idea is to stay within Cygwin), I would wager a bet that they continue to be happy without this change.
Now, the question is whether MSYS2 (for which mixing and matching MSYS2/MINGW program invocations is a lot more natural than it is for Cygwin) cares. If so, I think we should implement that toggle via the environment variable MSYS, and once we tested this for a couple months/years and are happy that it works for us, we should try to convince Cygwin to accept this change, too.
Note: this is a reversal of my earlier position where I thought this patch should go into Cygwin first. I had not understood the use case properly.