Nim icon indicating copy to clipboard operation
Nim copied to clipboard

fix #14343, parseCmdLine

Open timotheecour opened this issue 3 years ago • 27 comments

fix #14343, parseCmdLine

backward compatibility

  • parseCmdLine now returns correct results on posix and is such that: parseCmdLine(quoteShellCommand(a)) == a
  • parseCmdLine now raises on posix for invalid inputs (e.g. quotes not properly closed).

future work

  • [ ] expose iterator shlex
  • [ ] add tests for windows and consider whether CommandLineToArgvW should be used there

links

  • https://docs.microsoft.com/en-us/cpp/cpp/main-function-command-line-args?redirectedfrom=MSDN&view=msvc-160

timotheecour avatar Jul 31 '21 18:07 timotheecour

My question is, why are we using custom logic, rather than CommandLineToArgvW? Using any algorithm other than the one that function uses is a good way to have differences between the output parseCmdLine and the arguments a program would be passed if it were invoked via powershell or the command prompt.

Varriount avatar Jul 31 '21 22:07 Varriount

My question is, why are we using custom logic, rather than CommandLineToArgvW?

that API is only for windows, and I'm not changing the behavior on windows in this PR; feel free to open another PR to use CommandLineToArgvW for windows; but note it should also keep working at CT, with nimscript and for js

timotheecour avatar Jul 31 '21 23:07 timotheecour

Change in behavior plus it changes logic we lived with for years: Bugfix needs to be opt-in, later opt-out.

Araq avatar Aug 08 '21 17:08 Araq

Change in behavior plus it changes logic we lived with for years: Bugfix needs to be opt-in, later opt-out.

done, see -d:nimPreviewParseCmdLine

I used the same logic + rationale as in https://github.com/nim-lang/Nim/pull/18486:

  • -d:nimPreviewParseCmdLine is enabled in devel so the bugfix is immediately available by default for users tracking devel
  • -d:nimPreviewParseCmdLine is opt-in in 1.6 branch, and nothing needs to be done in version-1-6 branch

timotheecour avatar Aug 08 '21 18:08 timotheecour

With this PR and -d:nimPreviewParseCmdLine

import os
echo parseCmdLine("\"\"\"takes quoted -s \"\"\"\"\"")
echo parseCmdLine("\"\"\"takes quoted -s \" \"\"\"\"")
echo "AIEEEEE - Below RAISES with -d:nimPreviewParseCmdLine"
echo parseCmdLine("\"\"\"takes quoted -s \"\"\"\"")

produces

@["takes quoted -s "]
@["takes quoted -s ", ""]

I.e., a quoted space acts quite differently than a quoted empty string when inside triple quotes.

I actually use this call in cligen config files in two places - environment variable interpretation and inline alias interpretation. End command-line users are intended to edit both. The latter is obscure, but the former is literally recommended in the README since forever.

There are like 40 reverse deps on cligen in just the nimbleverse. Impacted end users may not even have written their own config files or env.var settings for shell init files, but rather copied them. This is even more likely if weird quoting rules are involved. A tool author/packager could easily recompile/test against a "simple" config, not running into trouble. They might even share that "tested" binary via /usr/local/bin|wherever.

In short, it is very imaginable that this could lead to end users being at a near total loss if someone recompiles against a new Nim and exceptions are now raised when none before were possible. And this is only for 2 use cases in cligen. While I have not done a survey, I expect many or even most use cases of parseCmdLine will be on "end user input" of some kind.

Even following the proposed opt-in/opt-out procedure could mean frustrated end users getting crashes at the opt-out stage and needing to ask someone for a re-compile -- if they even know to ask, if they even know the now-crashing tool is written in Nim.

Tagging @pb-cdunn because I know he has tons of dark/non-nimbleverse code that might use include cligen/mergeCfgEnv and I think he has end users besides himself. Honestly, there are a dozen people I could/should tag, many more I would not even know to tag and possibly some one (gasp) not even on github.

I know @Araq does not need to hear all this. He basically gave a more brief argument. I personally dislike the current rules, but this change would seem ill-advised even before we began being more careful about frivolous breaking changes.

If you have a burning need to "fix this", it would be better to add whole new API call like parsePosixCmd. Then changing or not can be a package author's negotiation with their users at whatever pace makes sense. In that context, rather than "beginner's guides", it would be best to follow some "real" POSIX spec or talk to the guy working on https://www.oilshell.org/ or adapt a real parser from Dash/something. In fact, I think this is a very isolated project that would be great starting life as a package outside the stdlib.

c-blake avatar Aug 10 '21 19:08 c-blake

If you have a burning need to "fix this", it would be better to add whole new API call like parsePosixCmd. Then changing or not can be a package author's negotiation with their users at whatever pace makes sense.

Amen. Let's have parsePosixCmd instead and document the non-standard behavior of the old proc.

Araq avatar Aug 10 '21 20:08 Araq

In fact, https://github.com/SolitudeSF/shlex already exists. I doubt from a glance that it is fully compliant with the POSIX spec. @SolitudeSF might know. It would be best to make one of those in a package and really test it thoroughly before pushing some half-correct new thing on everyone. There are shell conformance test suites that could be adapted. Then Nim core won't have to worry about staying backwards compatible with any non-conformities in early impls.

c-blake avatar Aug 10 '21 20:08 c-blake

With this PR and -d:nimPreviewParseCmdLine produces [...]

in other words, my PR works exactly as intended, and parses the cmdline like the shell (or any other tool parsing cmdlines, eg python3 shlex, or C wordexp, etc), unlike before this PR which produces something useless that has no relation to how any posix shell work.

# fun.nim:
import os
echo "shell output: ", commandLineParams()

# main.nim:
when true:
  import os
  template fn(cmd) =
    echo()
    let ret = execShellCmd("/tmp/t12703b" & " " & cmd)
    echo (ret: ret)
    echo "parse output: ", parseCmdLine(cmd)
  fn "\"\"\"takes quoted -s \"\"\"\"\""
  fn "\"\"\"takes quoted -s \" \"\"\"\""
  fn "\"\"\"takes quoted -s \"\"\"\""

nim c -o:/tmp/t12703b fun.nim nim r -d:nimPreviewParseCmdLine main


shell output: @["takes quoted -s "]
(ret: 0)
parse output: @["takes quoted -s "]

shell output: @["takes quoted -s ", ""]
(ret: 0)
parse output: @["takes quoted -s ", ""]

sh: -c: line 0: unexpected EOF while looking for matching `"'
sh: -c: line 1: syntax error: unexpected end of file
(ret: 2)
/Users/timothee/git_clone/nim/timn/tests/nim/all/t12703.nim(22) t12703
/Users/timothee/git_clone/nim/Nim_prs/lib/std/private/shlexutils.nim(104) parseCmdLine
Error: unhandled exception: error: seUnclosedDoubleQuote a: """takes quoted -s """" [ValueError]

I.e., a quoted space acts quite differently than a quoted empty string when inside triple quotes.

there's no such thing as triple quotes, the shell parses this:

echo parseCmdLine("\"\"\"takes quoted -s \"\"\"\"\"")

as this:

"""takes quoted -s """""
=>
("")("takes quoted -s ")("")("")
=>
@["takes quoted -s "]

it works as intended, as the shell, and as wordexp, shlex etc

likewise with this:

"\"\"\"takes quoted -s \" \"\"\"\""
=>
"""takes quoted -s " """"
=>
("")("takes quoted -s ") ("")("")
=>
@["takes quoted -s ", ""]

also working as intended, and as all other posix shells/cmdline parsing tools

likewise with the last case, which as you can see produces an error in the shell as well as in wordexp or python shlex or any other cmdline parsing tool

timotheecour avatar Aug 10 '21 20:08 timotheecour

Sorry, but "what no shell uses" != "useless." The doc for parseCmdLine never said "it works like a shell" whatever your personal expectations might be. Also, shells are far from a trusted authority on good quoting rules. In fact, they are notoriously bad. Copying bad design is not good design. Bell Labs guys I know (where Bourne came from) prefer Plan 9 RC Shell quoting rules.

c-blake avatar Aug 10 '21 20:08 c-blake

The new exception raising is the real problem, as immediately identified by @Araq and already explained in detail, not any qualms you have with my example that produces such raising.

c-blake avatar Aug 10 '21 21:08 c-blake

Sorry, but "what no shell uses" != "useless." The doc for parseCmdLine never said "it works like a shell" whatever your personal expectations might be.

Well the documentation says:


  ## On Posix systems, it uses the following parsing rules:
  ## Components are separated by whitespace unless the whitespace
  ## occurs within ``"`` or ``'`` quotes.

I wrote this code 11 years ago, as we have seen by now people depend on the documented behavior. You cannot "fix" 11 year old code that works according to its own documentation. You can only introduce new procs.

Araq avatar Aug 10 '21 21:08 Araq

I wrote this code 11 years ago, as we have seen by now people depend on the documented behavior. You cannot "fix" 11 year old code that works according to its own documentation. You can only introduce new procs.

@Araq done, I've reverted the change to parseCmdLine and introduced parseShellCommand which is such that parseShellCommand(quoteShellCommand(a)) == a and matches quoting rules for "'\ on posix. ${|`! etc are intentionally treated as normal characters to keep the shell specific custom logic out of this API and make parseShellCommand a drop-in (cross platform) replacement for parseCmdLine without the bugs from https://github.com/nim-lang/Nim/issues/14343; in particular interpretation of $ENVVAR or `pwd` is out of scope for parseShellCommand.

A future PR should deprecate parseCmdLine because its behavior doesn't match how shells work on posix, but this can be debated elsewhere.

timotheecour avatar Aug 11 '21 07:08 timotheecour

It has been amply explained why parseCmdLine need never be deprecated.

The inverse relation you seem fond of will never be true in reverse { quote(parse())=identity } since parsing erases info intentionally. You may want to mention that since you mention the round-trip in one direction.

Your doc comment says "verifies" this one-direction-inverse relation. I believe "satisifies" better expresses your meaning or you could re-arrange to say the expression "holds". "Verifies" suggests an active validation running the transformation which does not/should not happen, IIUC.

"Future work" comments that "can" but may never happen seem like distracting commentary. This commentary is especially out of place when "handle special chars (implicitly how various shells might)" still sounds very much like an external package to me - it can rapidly become a whole shell-written-in-Nim project which I doubt @Araq wants in the stdlib anytime soon.

I say all this because you seem agitated that the not-advertised-as-a-shell-parser parseCmdLine is somehow false advertising/needs to go simply because quoteShell also exists. So, better to accurately advertise the new thing both to users & Nim hackers. (Beats me why they cannot both coexist forever.)

c-blake avatar Aug 11 '21 14:08 c-blake

@araq PTAL

It has been amply explained why parseCmdLine need never be deprecated.

I'm not deprecating it in this PR so there's no point in debating this here

The inverse relation you seem fond of will never be true in reverse { quote(parse())=identity } since parsing erases info intentionally. You may want to mention that since you mention the round-trip in one direction.

ok, I added a comment reminding this.

you seem fond of

that's hardly an odd requirement. Yes, I am fond of sane APIs.

Your doc comment says "verifies" this one-direction-inverse relation. I believe "satisifies" better expresses your meaning or you could re-arrange to say the expression "holds". "Verifies" suggests an active validation running the transformation which does not/should not happen, IIUC.

changed to satisfies

"Future work" comments that "can" but may never happen seem like distracting commentary.

I've removed that comment

timotheecour avatar Aug 11 '21 18:08 timotheecour

Interesting thread. @timotheecour, nice work. But I'm glad that folks are discouraging you from altering existing behavior.

"Bugfix needs to be opt-in, later opt-out."

That makes sense to me. I appreciate the concern for backward compatibility shown by araq and c-blake. I can see value in your new functions (and they could be func, I think) as additions to the standard library.

I also agree heartily with araq (in another thread) that shell-parsing is the domain of the shell. Anything that tries to mimic shell behavior will always fall a bit short. It's better to simplify and make assumptions. And in that case, I like the idea of throwing an exception when an assumption is violated. But again, we don't want to modify the API.

So as a compromise, I would add an optional argument to parseCmdLine() that would cause it to throw on many violations. (I would clarify that we reserve the right to throw on even more violations in the future, and that we only agree never to throw when the input satisfies clear and obvious assumptions. But maybe that over-complicates.) Such an option might also serve as the "opt-in"

proc parseCmdLine(c: string, strict=false): seq[string]

## strict=true may throw an exception if fairly obvious assumptions are violated.
## Strict-mode also fixes some small bugs in the non-strict parser.

pb-cdunn avatar Aug 11 '21 21:08 pb-cdunn

and they could be func, I think

done

the other comments are not relevant anymore because this PR now introduces a new API parseShellCommand instead of modifying the existing one parseCmdLine.

timotheecour avatar Aug 11 '21 22:08 timotheecour

Yeah, but I actually liked the fact that you were fixing, or "clarifying" the existing API. I hate to lose all your discoveries on the idiosyncrasies of the legacy API.

pb-cdunn avatar Aug 13 '21 15:08 pb-cdunn

But I'm glad that folks are discouraging you from altering existing behavior But again, we don't want to modify the API. [..] Yeah, but I actually liked the fact that you were fixing, or "clarifying" the existing API. I hate to lose all your discoveries on the idiosyncrasies of the legacy API.

your 2 suggestions (preserving behavior and fixing bug in existing API) are mutually incompatible.

@araq can we please move forward with this PR or what else is blocking it?

timotheecour avatar Aug 13 '21 20:08 timotheecour

@Araq can we please move forward with this PR or what else is blocking it?

Well the old code should not be "fixed" and a new shell command parser can start its life as a Nimble package...

Araq avatar Aug 14 '21 08:08 Araq

no, this should not be moved to a nimble package. The feature itself (parsing shell cmdline) is stdlib territory and is in fact already being used in many places including stdlib, tools and compiler, except it doesn't work when the input falls into cases mentioned in https://github.com/nim-lang/Nim/issues/14343, causing one to use workarounds

eg, in testament.nim:

  # we avoid using `parseCmdLine` which is buggy, refs bug #14343
  result = cmdTemplate % ["target", targetToCmd[target],
                      "options", options, "file", filename.quoteShell,
                      "filedir", filename.getFileDir(), "nim", compilerPrefix]

or workarounds like these: ("--nimCache:" & nimcache).quoteShell because "--nimCache:$#" % nimcache.quoteShell was not working before this PR. (see also https://github.com/nim-lang/Nim/pull/18086). These are fixed by this PR. (there's also allowWhitespaceAfterColon but that's a separate topic, refs https://github.com/nim-lang/Nim/issues/9619, which causes --foo:"" bar to be treated by parseOpt in a very non-intuitive way in which bar is an argument of --foo instead of empty string being argument of foo; separate topic from this PR, but which should be addressed eventually by changing default for allowWhitespaceAfterColon)

All uses of parseCmdLine should in fact be replaced by parseShellCommand in followup PRs and I honestly don't know of any use case of parseCmdLine with the pre-existing behavior but I'm not touching nor deprecating the behavior of parseCmdLine as requested (even though I disagree with that); even the "counter example" given in https://github.com/nim-lang/Nim/pull/18622#issuecomment-896242203 turns out to work as it should after this PR, as shown in https://github.com/nim-lang/Nim/pull/18622#issuecomment-896307562 which shows it does exactly what the shell does.

Just because some code that was introduced 11 years ago had a bug doesn't mean we should have to live with it forever or prevent us from introducing a replacement API that doesn't have this bug.

The pre-existing code had exactly 0 tests, whereas this API has extensive tests, and match both python shlex and shell behavior.

timotheecour avatar Aug 14 '21 08:08 timotheecour

a new shell command parser can start its life as a Nimble package...

As mentioned there is one already: https://github.com/SolitudeSF/shlex even named similarly and which already does "future work" listed here (exposing the iterator) and just gives a bool if there was an error rather than raising exceptions.

Also, parse(quote())=identity is just a made up requirement if quote is for other parsers (real shells that would execute). No real contact with "sanity". splitQuoted might be a name less building of false hope for a full POSIX shell parser (which is what is expected on the other side of quoteShellCommand, AFAIU). "split" in there more suggests the well known combined iterator/func API from strutils. That naming may also track with @pb-cdunn's idea of harmoniously coexisting splits. Another syntax variation is preserving delimiting text so that join(splitQ(), "")=identity - another made up requirement (probably with more output from splitQ). splitPlan9rcQ is another good one. splitJustEscaped maybe, too. The output of "parse*" routines often includes token types and so on.

There are a lot of choices here and it all really "just depends" upon what both users & client code want/expect. I mean, you could force users to type JSON ["these", "are", "my", "strings"] with no PR. You even mentioned that on some cligen issue. When there are not active conflicts, I personally like to support of "all the above since circumstances vary with docs to address confusion", but I am also not on the hook to maintain the stdlib.

Since I cannot find a single nimble dependency upon shlex, I'm not sure how critical any of this really is. I do think stdlib/stdtools being able to depend upon packages would be a good ability and allow @Araq not have to oversee so much stuff he doesn't care much about. Maybe atlas is the beginning of that solution.

c-blake avatar Aug 14 '21 12:08 c-blake

renamed to splitShellCmd.

timotheecour avatar Aug 15 '21 06:08 timotheecour

your 2 suggestions (preserving behavior and fixing bug in existing API) are mutually incompatible.

Strictly speaking, yes. But I would add an argument to the existing API to control whether it throws exceptions on unexpected behavior. That would provide a very easy way to continue to use the old API but suddenly restrict it.

You found real problems with the existing API, but most users will never be aware of them unless they read this thread. The old API is not terrible as long as people understand what it doesn't do.

I do like your new code though.

pb-cdunn avatar Aug 21 '21 18:08 pb-cdunn

But I would add an argument to the existing API to control whether it throws exceptions on unexpected behavior.

adding a throws=false argument wouldn't help with the other problems that are fixed by splitShellCmd, e.g.

echo parseCmdLine("foo --nimcache:'hi world'") # @["foo", "--nimcache:\'hi", "world\'"]
echo splitShellCmd("foo --nimcache:'hi world'") # @["foo", "--nimcache:hi world"]

since introducing a breaking change was deemed unacceptable to some users, this leaves us with introducing a separate API splitShellCmd which is what this PR does.

@araq is there anything else blocking this PR?

timotheecour avatar Aug 21 '21 22:08 timotheecour

@Araq is there anything else blocking this PR?

Read the diff yourself: The changelog still mentions the breaking change, you still "patched" os.nim instead of only introducing yet another stdlib module. But even then I don't know why we need a new stdlib module, if you want to use SolitudeSF/shlex in testament, use SolitudeSF/shlex in testament. We should try to use the new atlas tool to setup the dependency.

Araq avatar Aug 26 '21 07:08 Araq

Read the diff yourself: The changelog still mentions the breaking change

artifact of /changelog.md merge=union; latest commit fixes the changelog

you still "patched" os.nim instead of only introducing yet another stdlib module.

it clearly belongs in std/os because std/os contains also quoteShell and quoteShell command which relates to this API via the relation splitShellCmd(quoteShellCommand(a)) == a and both APIs are commonly used together in a given program; basic tight cohesion principle. Furthermore it sits next to parseCmdLine which it intends to replace for almost all client code (still unaware of any valid use case justifying the current behavior of parseCmdLine in aforementioned edge cases, but I've not deprecated it nevertheless).

But even then I don't know why we need a new stdlib module, if you want to use SolitudeSF/shlex in testament, use SolitudeSF/shlex in testament. We should try to use the new atlas tool to setup the dependency.

I'm not introducing a new stdlib module, std/private/shlexutils is private (at least for now); the API clearly belongs in stdlib because splitting a shell cmdline into arguments is a common operation as evidenced by the callers of parseCmdLine in stdlib, testament, tools, and 3rd party packages. Not to mention that SolitudeSF/shlex has its own bugs; I'm following the behavior of python shlex in this PR.

timotheecour avatar Aug 26 '21 07:08 timotheecour

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

stale[bot] avatar Sep 08 '22 23:09 stale[bot]