jq
jq copied to clipboard
Allow, and strip, comments (#, // or /*)
Possibly this failure has nothing to do with the actual code changes. Here is what the log says:
C:\msys64\usr\bin\bash -lc "pacman-key --verify msys2-keyring-r21.b39fb11-1-any.pkg.tar.xz.sig"
==> Checking msys2-keyring-r21.b39fb11-1-any.pkg.tar.xz.sig...
gpg: no valid OpenPGP data found.
gpg: the signature could not be verified.
Please remember that the signature file (.sig or .asc)
should be the first file given on the command line.
==> ERROR: The signature identified by msys2-keyring-r21.b[39](https://ci.appveyor.com/project/stedolan/jq/builds/46391082/job/ngd77yg97ie5ls1f#L39)fb11-1-any.pkg.tar.xz.sig could not be verified.
Command exited with code 1
cat config.log
cat: config.log: No such file or directory
Ref: #1785
Oof, I'm not sure that C-style comments in jq code are a good idea, but I understand that if we were to accept them in JSON inputs then maybe we should accept them in jq.
The other thing is that we really do need an option for JSON input strictness.
I find JSON comments to be useless. We couldn't preserve them in jq because... what do we attach them to? I guess maybe we could have has_comment/0/get_comment/0, but there's problems with that: to what values do comments attach? do we want the bloat hit on jv? do we want to output comments attached to values?
So what I do when I need to write comments in JSON texts is this:
{
"comment0": "This string is a comment",
"a":"whatever",
"comment1": "This is another comment, and the only reason I change the comment name is in case there's any processors out there that reject duplicate keys"
}
which does not allow me to have comments inside arrays, but hey, it works.
@nicowilliams there is a proposal by @pkoppstein about a strictness option https://github.com/jqlang/jq/issues/2643
I'm very worried about the conflict against the alternative operator.
I'm very worried about the conflict against the alternative operator.
Yes, we really can't have // comment, that much is clear, but I don't think I want even /* comment */ in the jq language, though I'd be ok with stripping those from input JSON texts.
@nicowilliams @wader @itchyny I think it's worth further dividing the target use cases / proposals into sub-scopes, given that the arguments for and against may be different for each.
- Stripping (as opposed to parsing) only:
- 1a. Strip from JSON input. All we are doing is allowing JSON input to have comments, and ignoring those comments. Everything downstream from there is the same as if the input JSON never had comments to begin with. Example:
echo '{ "hi": "there" /* this is a comment */ }' | jq '.' -cyields{"hi":"there"}- Note 1: the concern re alternative operator conflict is N/A here
- Note 2: the original intended scope of this issue #2548 is limited to this sub-scope, and was not intended to extend any further
- 1b. Strip from comments from
jqfilters. Same as above but for filters rather than JSON input. Example:echo '{ "hi": "there" }' | jq '. /* this is a comment */ ' -cyields{"hi":"there"}
- Parsing comments, which then become part of the parse tree and are available to be included in the output
- 2a. Parse from JSON input. Example (with extraneous spaces in the JSON input intentionally added):
echo '{ "hi": "there" /* this is a comment */ }' | jq '.' -cyields{"hi":"there" /* this is a comment */ }- Raises several issues, including where the comment attaches to and whether the comment output style should reflect the original style
- 2b. Parse from jq filter. - Raises several issues, including where the comment attaches to, whether the comment output style should reflect the original style, and C++-style comments potentially conflicting with the alternative operator
Can we all agree that the limited scope of simply stripping comments from JSON input (1a above) is cut & dry? The comments can just be ignored and we are done. It seems to me that all the various concerns that have been raised are only relevant in the other 3 scopes (1b, 2a, 2b)
Yes, ignoring JSON "comments" on input is fine, but I would like a command-line option for strictness, and we'll probably want strict and non-strict versions of fromjson so that jq code can also benefit from strictness. The next question is: which must come first, options for strictness of comment stripping?
I understand that some users have an urgent need to strip out JSON comments. I also imagine that some users like to validate JSON with jq and don't want jq to become more permissive than it is right now. That's the tension here for me.
All the other concerns have to do with the jq language and/or with how to represent comments internally. We can certainly punt on those and not attempt to add new types of comments to the jq language, and we can also not attempt to preserve comments in JSON parsing. (I suspect we'll never want to preserve comments in JSON parses for the simple reason that it's way too bloating for jv.)
Instead of extending the current lenient parser, and instead of just skipping the JavaScript-like comments, I would like to make the parser conforming to well-defined specifications. My preference is to make the default parser to follow RFC 8259, and support --format=json5 following https://spec.json5.org/ (if we really want to support JSON with comments), and maybe --format=yaml for https://yaml.org/spec/, and --format=jq-json for transition from the current lenient parser (if someone really want).
@itchyny I think it is a great idea to target well-defined specifications. Re json5, the immediate impact with of that, with respect to this particular issue, is that # would no longer be a comment delimiter, and both C- and C++-style comments would remain in scope.
Beyond that, json5 would also support single- or multi-line strings strings that are unquoted, or that are enveloped in single quotes, and while I agree that those would be useful features, I would also propose treating them as a separate issue / PR for practical purposes. In particular, looking at jv_parse.c it seems to me that the changes necessary to accommodate these string parsing features will be significantly more invasive, and even more so if parsing yaml (so much so that if I were to take on those tasks-- which I probably cannot do, alas-- I might find it easier to swap out the JSON parser to use a parser generator rather than a handwritten one-- in which case that parser change might itself be a standalone issue/PR).
Note that JSON5 is neither an ECMA standard nor an IETF standard. Not that jq only implements standards, mind you, just that JSON5 isn't one. That shouldn't stop us adopting optional support for JSON5, but we probably don't want it to be the default. Indeed, I would like to have JSON5 support if for no other reason that object keys that are identifiers can be left unquoted, that and the trailing comma feature -- these going to be nice to have.
BTW, one thing that @stedolan told me long ago is that all these command-line options are kinda ugly, that we should aim to make it possible to handle all of them in jq code. So for example, -s is the same as -n + starting the jq program with [inputs] | , and not using -s is the same as using -n and starting the program with inputs | . Now, to really make that possible for most / all jq command-line options will take a lot of work, perhaps including finishing the co-routines features.
Now, for jq code itself one of the things we'll need is a form of fromjson/tojson for each supported format. And I agree that it'd be very nice to have not just spec conformance, but also YAML support, and maybe even XML. So maybe we need fromjson5/tojson5.
Also, I think we need to separate all the parsers for all flavors of JSON. I'd rather have lots of code duplication than to have lots of branches complicating and slowing down parsing. Ditto printing. The price to pay in bloat might be annoying, so we might need ./configure options for disabling various features.
OK this PR has been updated and now covers the following:
- C- and C++-style comments (but not
#comments) are stripped - this behavior is only enabled when a
--strip-commentsoption is enabled - added a couple tests in tests/commentstest
I understand there may not yet be consensus around the use of a --strip-comments option, but figured I'd include it for now and you can decide when you merge if you don't want it, or want to change it (or lmk and I can update the PR).
@liquidaty ah, do please get rid of the merge commit. Neither @stedolan nor I like them -- we prefer linear history upstream. If you don't know how to get rid of it I could walk you through it, or I could force push to this PR.
Will this PR be going to 1.7? We need more discussions on this topic. As a user, it does not mention the formal definition about the comment syntax, so it's not clear which form of comments are stripped; e.x. multiline comments can be nested or not. Dealing with comments in JSON will soon raise a feature request to re-format a JSON file with keeping comments, but I'm afraid such request is hard to implement within the current parser.
Will this PR be going to 1.7? We need more discussions on this topic. As a user, it does not mention the formal definition about the comment syntax, so it's not clear which form of comments are stripped; e.x. multiline comments can be nested or not. Dealing with comments in JSON will soon raise a feature request to re-format a JSON file with keeping comments, but I'm afraid such request is hard to implement within the current parser.
Great question! I think yeah, it's probably best to wait till after 1.7 to integrate this.
Because the jv representation of JSON values really can't represent comments without a lot of bloat, I think we'll never agree to preserve comments.
@liquidaty ah, do please get rid of the merge commit [...] If you don't know how to get rid of it I could walk you through it, or I could force push to this PR
@nicowilliams I unfortunately do not know (for sure) how to get rid of it and would be happy to try, but I think at least as likely to make it worse than better. Please feel free to walk me thru it or force push yourself, whatever is more convenient for you
Will this PR be going to 1.7? We need more discussions on this topic
probably best to wait till after 1.7 to integrate this
@itchyny @nicowilliams: regarding the limited scope of just stripping C- and C++-style behavior is (i.e. 1a from https://github.com/jqlang/jq/pull/2548#issuecomment-1622664788), then given:
- how uncontroversial that limited scope is
- how many users want this (this has got to be a top-10 most asked-for feature, maybe top 3, maybe top 1: see e.g. #1571, #402, #695, #1398), and
- easy it would be to include in 1.7 (since we already have here a fully working PR that passes all checks, with the only remaining difference of opinion being over what description is used in help and/or man pages-- and I'm fairly confident that the majority of users looking for this solution won't mind whatever outcome you decide on that)
then, I hope you will consider, for your own sakes as well as that of others waiting on this feature: just accept it in 1.7, thereby putting the whole bunch of users looking for this at ease, and buying yourself plenty of time to later decide on all the other issues in the broader comment-related scope
@liquidaty here's what I did:
git fetch origin pull/2548/head:pr2548git checkout pr2548git rebase -i origin/master- fix conflicts
- the merge commit was dropped automatically by the rebase
- reviewed and noticed that one commit was mostly the reversal of the other
git rebase -i origin/master- mark 669cac5 for
edit, save, let the rebase run git checkout -f HEAD^ -- src/main.cgit commit --amendgit checkout 669cac5 -- src/main.cgit commit --fixup HEAD^git rebase --continue
- mark 669cac5 for
git rebase -i origin/masteragain- move the
update --strip-comments description; remove lexer changescommit to right afteradd c-style comment stripping from jq program input, then save and let the rebase run - notice that now there are no diffs between
allow, and strip, comments (#, // or /*)andupdate --strip-comments description; remove lexer changes
- move the
git rebase -i origin/masteragain and mark those to commits to bedroped, let it run- notice that the last commit should be part of the previous commit, so
git rebase -i origin/masteronce again andsquashthe last commit into the previous (I could just have donegit checkout HEAD^ && git commit --amendinstead), but also mark the first commit forrewordand capitalize the first letter of its synopsis, then do the same for the now last commit git remote add liquidaty https://github.com/liquidaty/jqgit remote set-url --push liquidaty [email protected]:liquidaty/jqgit push -f liquidaty HEAD:strip-comments
@liquidaty you'll want to review the new PR HEAD, and you'll want to fetch it as well and compare to your local branch so you can use git diff to see the differences. The only differences should be: a) all the things in jqlang/jq's master that you didn't already have, b) the lexer changes that got undone.
@nicowilliams awesome! super helpful and definitely not something I would have done correctly, so thank you.
I do see some minor changes that were dropped, that I probably should add back in-- they are minor in that they will not show up in the current checks where input is piped directly into jq, but without these changes added back in, they would show up if, like in jq_test.c, the input was first parsed using jv_parse(), which relies on the return code being 0 and not OK.
Just to confirm, if I now just make those changes (touching a few lines to jv_parse.c to return 0 instead of OK) to origin/strip-comments and push them as a new commit, it will retain the history the way you prefer, correct?
Just to confirm, if I now just make those changes (touching a few lines to jv_parse.c to return 0 instead of OK) to origin/strip-comments and push them as a new commit, it will retain the history the way you prefer, correct?
You'll want to apply those changes on top of the commit I pushed. Something like:
git fetch origin pull/2548/head:pr2548git checkout pr2458-
git add ...git commit -m ...git push $your_repos_name pr2548:strip-comments
and then whatever your old branch was in your local clone, you can now abandon it, delete it, or make its HEAD be the same as pr2548's (git branch -f $that_local_branch pr2548).
@nicowilliams good to go now, thanks for your help!
One reason to wait till after 1.7 for this is that I've been fuzzing the current state of the parser, so I'd rather not touch the parser until after 1.7 :)
@nicowilliams I see. Totally off-topic: out of curiosity, what fuzzing tools are you using? am new to the art of fuzzing and looking to learn more about how it's being done in practice. Also have been considering suggesting some simd-related improvements to jq's parser, so might come in handy to understand the fuzzing context in which such suggestions might be considered
@nicowilliams I see. Totally off-topic: out of curiosity, what fuzzing tools are you using? am new to the art of fuzzing and looking to learn more about how it's being done in practice. Also have been considering suggesting some simd-related improvements to
jq's parser, so might come in handy to understand the fuzzing context in which such suggestions might be considered
I'm using AFL, because I know how to use it and I've used it before. But see #2255 and #2688.
Some fuzzers like LibFuzzer and HongFuzz require that you provide a function that takes a pointer to data and a length, and then you link with their object files to produce an executable that fuzzes that function, and when you run it you have to provide an initial test corpus. AFL requires (mostly) that you build with a version of gcc or clang that instruments the code, and then you you run the executable under afl-fuzz with options that specify an initial test corpus.
@nicowilliams thank you, much appreciated
@liquidaty can you rebase this?
If it's ok i can rebase this but not sure if i can push to the PR branch?
If it's ok i can rebase this but not sure if i can push to the PR branch?
What do you mean? Just rebase it on a fork and submit it as a PR, it'll create a "branch" automatically. You don't need to ask for permission. Maybe I'm misunderstanding something, though.