Nim
Nim copied to clipboard
fix #14343, parseCmdLine
fix #14343, parseCmdLine
backward compatibility
-
parseCmdLine
now returns correct results on posix and is such that:parseCmdLine(quoteShellCommand(a)) == a
-
parseCmdLine
now raises on posix for invalid inputs (e.g. quotes not properly closed).
future work
- [ ] expose
iterator shlex
- [ ] add tests for windows and consider whether
CommandLineToArgvW
should be used there
links
- https://docs.microsoft.com/en-us/cpp/cpp/main-function-command-line-args?redirectedfrom=MSDN&view=msvc-160
My question is, why are we using custom logic, rather than CommandLineToArgvW
? Using any algorithm other than the one that function uses is a good way to have differences between the output parseCmdLine and the arguments a program would be passed if it were invoked via powershell or the command prompt.
My question is, why are we using custom logic, rather than CommandLineToArgvW?
that API is only for windows, and I'm not changing the behavior on windows in this PR; feel free to open another PR to use CommandLineToArgvW
for windows; but note it should also keep working at CT, with nimscript and for js
Change in behavior plus it changes logic we lived with for years: Bugfix needs to be opt-in, later opt-out.
Change in behavior plus it changes logic we lived with for years: Bugfix needs to be opt-in, later opt-out.
done, see -d:nimPreviewParseCmdLine
I used the same logic + rationale as in https://github.com/nim-lang/Nim/pull/18486:
-
-d:nimPreviewParseCmdLine
is enabled in devel so the bugfix is immediately available by default for users tracking devel -
-d:nimPreviewParseCmdLine
is opt-in in 1.6 branch, and nothing needs to be done in version-1-6 branch
With this PR and -d:nimPreviewParseCmdLine
import os
echo parseCmdLine("\"\"\"takes quoted -s \"\"\"\"\"")
echo parseCmdLine("\"\"\"takes quoted -s \" \"\"\"\"")
echo "AIEEEEE - Below RAISES with -d:nimPreviewParseCmdLine"
echo parseCmdLine("\"\"\"takes quoted -s \"\"\"\"")
produces
@["takes quoted -s "]
@["takes quoted -s ", ""]
I.e., a quoted space acts quite differently than a quoted empty string when inside triple quotes.
I actually use this call in cligen
config files in two places - environment variable interpretation and inline alias interpretation. End command-line users are intended to edit both. The latter is obscure, but the former is literally recommended in the README since forever.
There are like 40 reverse deps on cligen
in just the nimbleverse. Impacted end users may not even have written their own config files or env.var settings for shell init files, but rather copied them. This is even more likely if weird quoting rules are involved. A tool author/packager could easily recompile/test against a "simple" config, not running into trouble. They might even share that "tested" binary via /usr/local/bin|wherever.
In short, it is very imaginable that this could lead to end users being at a near total loss if someone recompiles against a new Nim and exceptions are now raised when none before were possible. And this is only for 2 use cases in cligen
. While I have not done a survey, I expect many or even most use cases of parseCmdLine
will be on "end user input" of some kind.
Even following the proposed opt-in/opt-out procedure could mean frustrated end users getting crashes at the opt-out stage and needing to ask someone for a re-compile -- if they even know to ask, if they even know the now-crashing tool is written in Nim.
Tagging @pb-cdunn because I know he has tons of dark/non-nimbleverse code that might use include cligen/mergeCfgEnv
and I think he has end users besides himself. Honestly, there are a dozen people I could/should tag, many more I would not even know to tag and possibly some one (gasp) not even on github.
I know @Araq does not need to hear all this. He basically gave a more brief argument. I personally dislike the current rules, but this change would seem ill-advised even before we began being more careful about frivolous breaking changes.
If you have a burning need to "fix this", it would be better to add whole new API call like parsePosixCmd
. Then changing or not can be a package author's negotiation with their users at whatever pace makes sense. In that context, rather than "beginner's guides", it would be best to follow some "real" POSIX spec or talk to the guy working on https://www.oilshell.org/ or adapt a real parser from Dash/something. In fact, I think this is a very isolated project that would be great starting life as a package outside the stdlib.
If you have a burning need to "fix this", it would be better to add whole new API call like parsePosixCmd. Then changing or not can be a package author's negotiation with their users at whatever pace makes sense.
Amen. Let's have parsePosixCmd
instead and document the non-standard behavior of the old proc.
In fact, https://github.com/SolitudeSF/shlex already exists. I doubt from a glance that it is fully compliant with the POSIX spec. @SolitudeSF might know. It would be best to make one of those in a package and really test it thoroughly before pushing some half-correct new thing on everyone. There are shell conformance test suites that could be adapted. Then Nim core won't have to worry about staying backwards compatible with any non-conformities in early impls.
With this PR and -d:nimPreviewParseCmdLine produces [...]
in other words, my PR works exactly as intended, and parses the cmdline like the shell (or any other tool parsing cmdlines, eg python3 shlex
, or C wordexp
, etc), unlike before this PR which produces something useless that has no relation to how any posix shell work.
# fun.nim:
import os
echo "shell output: ", commandLineParams()
# main.nim:
when true:
import os
template fn(cmd) =
echo()
let ret = execShellCmd("/tmp/t12703b" & " " & cmd)
echo (ret: ret)
echo "parse output: ", parseCmdLine(cmd)
fn "\"\"\"takes quoted -s \"\"\"\"\""
fn "\"\"\"takes quoted -s \" \"\"\"\""
fn "\"\"\"takes quoted -s \"\"\"\""
nim c -o:/tmp/t12703b fun.nim nim r -d:nimPreviewParseCmdLine main
shell output: @["takes quoted -s "]
(ret: 0)
parse output: @["takes quoted -s "]
shell output: @["takes quoted -s ", ""]
(ret: 0)
parse output: @["takes quoted -s ", ""]
sh: -c: line 0: unexpected EOF while looking for matching `"'
sh: -c: line 1: syntax error: unexpected end of file
(ret: 2)
/Users/timothee/git_clone/nim/timn/tests/nim/all/t12703.nim(22) t12703
/Users/timothee/git_clone/nim/Nim_prs/lib/std/private/shlexutils.nim(104) parseCmdLine
Error: unhandled exception: error: seUnclosedDoubleQuote a: """takes quoted -s """" [ValueError]
I.e., a quoted space acts quite differently than a quoted empty string when inside triple quotes.
there's no such thing as triple quotes, the shell parses this:
echo parseCmdLine("\"\"\"takes quoted -s \"\"\"\"\"")
as this:
"""takes quoted -s """""
=>
("")("takes quoted -s ")("")("")
=>
@["takes quoted -s "]
it works as intended, as the shell, and as wordexp, shlex etc
likewise with this:
"\"\"\"takes quoted -s \" \"\"\"\""
=>
"""takes quoted -s " """"
=>
("")("takes quoted -s ") ("")("")
=>
@["takes quoted -s ", ""]
also working as intended, and as all other posix shells/cmdline parsing tools
likewise with the last case, which as you can see produces an error in the shell as well as in wordexp or python shlex or any other cmdline parsing tool
Sorry, but "what no shell uses" != "useless." The doc for parseCmdLine
never said "it works like a shell" whatever your personal expectations might be. Also, shells are far from a trusted authority on good quoting rules. In fact, they are notoriously bad. Copying bad design is not good design. Bell Labs guys I know (where Bourne came from) prefer Plan 9 RC Shell quoting rules.
The new exception raising is the real problem, as immediately identified by @Araq and already explained in detail, not any qualms you have with my example that produces such raising.
Sorry, but "what no shell uses" != "useless." The doc for parseCmdLine never said "it works like a shell" whatever your personal expectations might be.
Well the documentation says:
## On Posix systems, it uses the following parsing rules:
## Components are separated by whitespace unless the whitespace
## occurs within ``"`` or ``'`` quotes.
I wrote this code 11 years ago, as we have seen by now people depend on the documented behavior. You cannot "fix" 11 year old code that works according to its own documentation. You can only introduce new procs.
I wrote this code 11 years ago, as we have seen by now people depend on the documented behavior. You cannot "fix" 11 year old code that works according to its own documentation. You can only introduce new procs.
@Araq done, I've reverted the change to parseCmdLine
and introduced parseShellCommand
which is such that parseShellCommand(quoteShellCommand(a)) == a
and matches quoting rules for "'\
on posix. ${|`!
etc are intentionally treated as normal characters to keep the shell specific custom logic out of this API and make parseShellCommand
a drop-in (cross platform) replacement for parseCmdLine
without the bugs from https://github.com/nim-lang/Nim/issues/14343; in particular interpretation of $ENVVAR
or `pwd`
is out of scope for parseShellCommand
.
A future PR should deprecate parseCmdLine
because its behavior doesn't match how shells work on posix, but this can be debated elsewhere.
It has been amply explained why parseCmdLine
need never be deprecated.
The inverse relation you seem fond of will never be true in reverse { quote(parse())=identity } since parsing erases info intentionally. You may want to mention that since you mention the round-trip in one direction.
Your doc comment says "verifies" this one-direction-inverse relation. I believe "satisifies" better expresses your meaning or you could re-arrange to say the expression "holds". "Verifies" suggests an active validation running the transformation which does not/should not happen, IIUC.
"Future work" comments that "can" but may never happen seem like distracting commentary. This commentary is especially out of place when "handle special chars (implicitly how various shells might)" still sounds very much like an external package to me - it can rapidly become a whole shell-written-in-Nim project which I doubt @Araq wants in the stdlib anytime soon.
I say all this because you seem agitated that the not-advertised-as-a-shell-parser parseCmdLine is somehow false advertising/needs to go simply because quoteShell also exists. So, better to accurately advertise the new thing both to users & Nim hackers. (Beats me why they cannot both coexist forever.)
@araq PTAL
It has been amply explained why parseCmdLine need never be deprecated.
I'm not deprecating it in this PR so there's no point in debating this here
The inverse relation you seem fond of will never be true in reverse { quote(parse())=identity } since parsing erases info intentionally. You may want to mention that since you mention the round-trip in one direction.
ok, I added a comment reminding this.
you seem fond of
that's hardly an odd requirement. Yes, I am fond of sane APIs.
Your doc comment says "verifies" this one-direction-inverse relation. I believe "satisifies" better expresses your meaning or you could re-arrange to say the expression "holds". "Verifies" suggests an active validation running the transformation which does not/should not happen, IIUC.
changed to satisfies
"Future work" comments that "can" but may never happen seem like distracting commentary.
I've removed that comment
Interesting thread. @timotheecour, nice work. But I'm glad that folks are discouraging you from altering existing behavior.
"Bugfix needs to be opt-in, later opt-out."
That makes sense to me. I appreciate the concern for backward compatibility shown by araq and c-blake. I can see value in your new functions (and they could be func
, I think) as additions to the standard library.
I also agree heartily with araq (in another thread) that shell-parsing is the domain of the shell. Anything that tries to mimic shell behavior will always fall a bit short. It's better to simplify and make assumptions. And in that case, I like the idea of throwing an exception when an assumption is violated. But again, we don't want to modify the API.
So as a compromise, I would add an optional argument to parseCmdLine()
that would cause it to throw on many violations. (I would clarify that we reserve the right to throw on even more violations in the future, and that we only agree never to throw when the input satisfies clear and obvious assumptions. But maybe that over-complicates.) Such an option might also serve as the "opt-in"
proc parseCmdLine(c: string, strict=false): seq[string]
## strict=true may throw an exception if fairly obvious assumptions are violated.
## Strict-mode also fixes some small bugs in the non-strict parser.
and they could be func, I think
done
the other comments are not relevant anymore because this PR now introduces a new API parseShellCommand
instead of modifying the existing one parseCmdLine
.
Yeah, but I actually liked the fact that you were fixing, or "clarifying" the existing API. I hate to lose all your discoveries on the idiosyncrasies of the legacy API.
But I'm glad that folks are discouraging you from altering existing behavior But again, we don't want to modify the API. [..] Yeah, but I actually liked the fact that you were fixing, or "clarifying" the existing API. I hate to lose all your discoveries on the idiosyncrasies of the legacy API.
your 2 suggestions (preserving behavior and fixing bug in existing API) are mutually incompatible.
@araq can we please move forward with this PR or what else is blocking it?
@Araq can we please move forward with this PR or what else is blocking it?
Well the old code should not be "fixed" and a new shell command parser can start its life as a Nimble package...
no, this should not be moved to a nimble package. The feature itself (parsing shell cmdline) is stdlib territory and is in fact already being used in many places including stdlib, tools and compiler, except it doesn't work when the input falls into cases mentioned in https://github.com/nim-lang/Nim/issues/14343, causing one to use workarounds
eg, in testament.nim:
# we avoid using `parseCmdLine` which is buggy, refs bug #14343
result = cmdTemplate % ["target", targetToCmd[target],
"options", options, "file", filename.quoteShell,
"filedir", filename.getFileDir(), "nim", compilerPrefix]
or workarounds like these: ("--nimCache:" & nimcache).quoteShell
because "--nimCache:$#" % nimcache.quoteShell
was not working before this PR. (see also https://github.com/nim-lang/Nim/pull/18086). These are fixed by this PR. (there's also allowWhitespaceAfterColon
but that's a separate topic, refs https://github.com/nim-lang/Nim/issues/9619, which causes --foo:"" bar
to be treated by parseOpt in a very non-intuitive way in which bar is an argument of --foo instead of empty string being argument of foo; separate topic from this PR, but which should be addressed eventually by changing default for allowWhitespaceAfterColon
)
All uses of parseCmdLine should in fact be replaced by parseShellCommand in followup PRs and I honestly don't know of any use case of parseCmdLine with the pre-existing behavior but I'm not touching nor deprecating the behavior of parseCmdLine as requested (even though I disagree with that); even the "counter example" given in https://github.com/nim-lang/Nim/pull/18622#issuecomment-896242203 turns out to work as it should after this PR, as shown in https://github.com/nim-lang/Nim/pull/18622#issuecomment-896307562 which shows it does exactly what the shell does.
Just because some code that was introduced 11 years ago had a bug doesn't mean we should have to live with it forever or prevent us from introducing a replacement API that doesn't have this bug.
The pre-existing code had exactly 0 tests, whereas this API has extensive tests, and match both python shlex and shell behavior.
a new shell command parser can start its life as a Nimble package...
As mentioned there is one already: https://github.com/SolitudeSF/shlex even named similarly and which already does "future work" listed here (exposing the iterator) and just gives a bool
if there was an error rather than raising exceptions.
Also, parse(quote())=identity is just a made up requirement if quote is for other parsers (real shells that would execute). No real contact with "sanity". splitQuoted
might be a name less building of false hope for a full POSIX shell parser (which is what is expected on the other side of quoteShellCommand
, AFAIU). "split" in there more suggests the well known combined iterator/func API from strutils
. That naming may also track with @pb-cdunn's idea of harmoniously coexisting split
s. Another syntax variation is preserving delimiting text so that join(splitQ(), "")=identity
- another made up requirement (probably with more output from splitQ). splitPlan9rcQ
is another good one. splitJustEscaped
maybe, too. The output of "parse*" routines often includes token types and so on.
There are a lot of choices here and it all really "just depends" upon what both users & client code want/expect. I mean, you could force users to type JSON ["these", "are", "my", "strings"]
with no PR. You even mentioned that on some cligen issue. When there are not active conflicts, I personally like to support of "all the above since circumstances vary with docs to address confusion", but I am also not on the hook to maintain the stdlib.
Since I cannot find a single nimble dependency upon shlex
, I'm not sure how critical any of this really is. I do think stdlib/stdtools being able to depend upon packages would be a good ability and allow @Araq not have to oversee so much stuff he doesn't care much about. Maybe atlas
is the beginning of that solution.
renamed to splitShellCmd
.
your 2 suggestions (preserving behavior and fixing bug in existing API) are mutually incompatible.
Strictly speaking, yes. But I would add an argument to the existing API to control whether it throws exceptions on unexpected behavior. That would provide a very easy way to continue to use the old API but suddenly restrict it.
You found real problems with the existing API, but most users will never be aware of them unless they read this thread. The old API is not terrible as long as people understand what it doesn't do.
I do like your new code though.
But I would add an argument to the existing API to control whether it throws exceptions on unexpected behavior.
adding a throws=false
argument wouldn't help with the other problems that are fixed by splitShellCmd
, e.g.
echo parseCmdLine("foo --nimcache:'hi world'") # @["foo", "--nimcache:\'hi", "world\'"]
echo splitShellCmd("foo --nimcache:'hi world'") # @["foo", "--nimcache:hi world"]
since introducing a breaking change was deemed unacceptable to some users, this leaves us with introducing a separate API splitShellCmd
which is what this PR does.
@araq is there anything else blocking this PR?
@Araq is there anything else blocking this PR?
Read the diff yourself: The changelog still mentions the breaking change, you still "patched" os.nim instead of only introducing yet another stdlib module. But even then I don't know why we need a new stdlib module, if you want to use SolitudeSF/shlex in testament, use SolitudeSF/shlex in testament. We should try to use the new atlas tool to setup the dependency.
Read the diff yourself: The changelog still mentions the breaking change
artifact of /changelog.md merge=union
; latest commit fixes the changelog
you still "patched" os.nim instead of only introducing yet another stdlib module.
it clearly belongs in std/os because std/os contains also quoteShell and quoteShell command which relates to this API via the relation splitShellCmd(quoteShellCommand(a)) == a
and both APIs are commonly used together in a given program; basic tight cohesion principle. Furthermore it sits next to parseCmdLine
which it intends to replace for almost all client code (still unaware of any valid use case justifying the current behavior of parseCmdLine
in aforementioned edge cases, but I've not deprecated it nevertheless).
But even then I don't know why we need a new stdlib module, if you want to use SolitudeSF/shlex in testament, use SolitudeSF/shlex in testament. We should try to use the new atlas tool to setup the dependency.
I'm not introducing a new stdlib module, std/private/shlexutils
is private (at least for now); the API clearly belongs in stdlib because splitting a shell cmdline into arguments is a common operation as evidenced by the callers of parseCmdLine
in stdlib, testament, tools, and 3rd party packages. Not to mention that SolitudeSF/shlex
has its own bugs; I'm following the behavior of python shlex in this PR.
This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.