jq Allow, and strip, comments (#, // or /*)

Mar 02 '23 19:03 liquidaty

Possibly this failure has nothing to do with the actual code changes. Here is what the log says:

C:\msys64\usr\bin\bash -lc "pacman-key --verify msys2-keyring-r21.b39fb11-1-any.pkg.tar.xz.sig"
==> Checking msys2-keyring-r21.b39fb11-1-any.pkg.tar.xz.sig...
gpg: no valid OpenPGP data found.
gpg: the signature could not be verified.
Please remember that the signature file (.sig or .asc)
should be the first file given on the command line.
==> ERROR: The signature identified by msys2-keyring-r21.b[39](https://ci.appveyor.com/project/stedolan/jq/builds/46391082/job/ngd77yg97ie5ls1f#L39)fb11-1-any.pkg.tar.xz.sig could not be verified.
Command exited with code 1
cat config.log
cat: config.log: No such file or directory

Mar 02 '23 19:03 liquidaty

Ref: #1785

Jun 03 '23 03:06 itchyny

Oof, I'm not sure that C-style comments in jq code are a good idea, but I understand that if we were to accept them in JSON inputs then maybe we should accept them in jq.

The other thing is that we really do need an option for JSON input strictness.

I find JSON comments to be useless. We couldn't preserve them in jq because... what do we attach them to? I guess maybe we could have has_comment/0/get_comment/0, but there's problems with that: to what values do comments attach? do we want the bloat hit on jv? do we want to output comments attached to values?

So what I do when I need to write comments in JSON texts is this:

{
   "comment0": "This string is a comment",
   "a":"whatever",
   "comment1": "This is another comment, and the only reason I change the comment name is in case there's any processors out there that reject duplicate keys"
}

which does not allow me to have comments inside arrays, but hey, it works.

Jul 05 '23 22:07 nicowilliams

@nicowilliams there is a proposal by @pkoppstein about a strictness option https://github.com/jqlang/jq/issues/2643

Jul 05 '23 22:07 wader

I'm very worried about the conflict against the alternative operator.

Jul 05 '23 22:07 itchyny

I'm very worried about the conflict against the alternative operator.

Yes, we really can't have // comment, that much is clear, but I don't think I want even /* comment */ in the jq language, though I'd be ok with stripping those from input JSON texts.

Jul 05 '23 23:07 nicowilliams

@nicowilliams @wader @itchyny I think it's worth further dividing the target use cases / proposals into sub-scopes, given that the arguments for and against may be different for each.

Stripping (as opposed to parsing) only:

1a. Strip from JSON input. All we are doing is allowing JSON input to have comments, and ignoring those comments. Everything downstream from there is the same as if the input JSON never had comments to begin with. Example: echo '{ "hi": "there" /* this is a comment */ }' | jq '.' -c yields {"hi":"there"}
- Note 1: the concern re alternative operator conflict is N/A here
- Note 2: the original intended scope of this issue #2548 is limited to this sub-scope, and was not intended to extend any further
1b. Strip from comments from jq filters. Same as above but for filters rather than JSON input. Example: echo '{ "hi": "there" }' | jq '. /* this is a comment */ ' -c yields {"hi":"there"}

Parsing comments, which then become part of the parse tree and are available to be included in the output

2a. Parse from JSON input. Example (with extraneous spaces in the JSON input intentionally added): echo '{ "hi": "there" /* this is a comment */ }' | jq '.' -c yields {"hi":"there" /* this is a comment */ }
- Raises several issues, including where the comment attaches to and whether the comment output style should reflect the original style
2b. Parse from jq filter. - Raises several issues, including where the comment attaches to, whether the comment output style should reflect the original style, and C++-style comments potentially conflicting with the alternative operator

Can we all agree that the limited scope of simply stripping comments from JSON input (1a above) is cut & dry? The comments can just be ignored and we are done. It seems to me that all the various concerns that have been raised are only relevant in the other 3 scopes (1b, 2a, 2b)

Jul 05 '23 23:07 liquidaty

Yes, ignoring JSON "comments" on input is fine, but I would like a command-line option for strictness, and we'll probably want strict and non-strict versions of fromjson so that jq code can also benefit from strictness. The next question is: which must come first, options for strictness of comment stripping?

I understand that some users have an urgent need to strip out JSON comments. I also imagine that some users like to validate JSON with jq and don't want jq to become more permissive than it is right now. That's the tension here for me.

All the other concerns have to do with the jq language and/or with how to represent comments internally. We can certainly punt on those and not attempt to add new types of comments to the jq language, and we can also not attempt to preserve comments in JSON parsing. (I suspect we'll never want to preserve comments in JSON parses for the simple reason that it's way too bloating for jv.)

Jul 05 '23 23:07 nicowilliams

Instead of extending the current lenient parser, and instead of just skipping the JavaScript-like comments, I would like to make the parser conforming to well-defined specifications. My preference is to make the default parser to follow RFC 8259, and support --format=json5 following https://spec.json5.org/ (if we really want to support JSON with comments), and maybe --format=yaml for https://yaml.org/spec/, and --format=jq-json for transition from the current lenient parser (if someone really want).

Jul 06 '23 03:07 itchyny

@itchyny I think it is a great idea to target well-defined specifications. Re json5, the immediate impact with of that, with respect to this particular issue, is that # would no longer be a comment delimiter, and both C- and C++-style comments would remain in scope.

Beyond that, json5 would also support single- or multi-line strings strings that are unquoted, or that are enveloped in single quotes, and while I agree that those would be useful features, I would also propose treating them as a separate issue / PR for practical purposes. In particular, looking at jv_parse.c it seems to me that the changes necessary to accommodate these string parsing features will be significantly more invasive, and even more so if parsing yaml (so much so that if I were to take on those tasks-- which I probably cannot do, alas-- I might find it easier to swap out the JSON parser to use a parser generator rather than a handwritten one-- in which case that parser change might itself be a standalone issue/PR).

Jul 06 '23 17:07 liquidaty

Note that JSON5 is neither an ECMA standard nor an IETF standard. Not that jq only implements standards, mind you, just that JSON5 isn't one. That shouldn't stop us adopting optional support for JSON5, but we probably don't want it to be the default. Indeed, I would like to have JSON5 support if for no other reason that object keys that are identifiers can be left unquoted, that and the trailing comma feature -- these going to be nice to have.

BTW, one thing that @stedolan told me long ago is that all these command-line options are kinda ugly, that we should aim to make it possible to handle all of them in jq code. So for example, -s is the same as -n + starting the jq program with [inputs] | , and not using -s is the same as using -n and starting the program with inputs | . Now, to really make that possible for most / all jq command-line options will take a lot of work, perhaps including finishing the co-routines features.

Now, for jq code itself one of the things we'll need is a form of fromjson/tojson for each supported format. And I agree that it'd be very nice to have not just spec conformance, but also YAML support, and maybe even XML. So maybe we need fromjson5/tojson5.

Jul 06 '23 19:07 nicowilliams

Also, I think we need to separate all the parsers for all flavors of JSON. I'd rather have lots of code duplication than to have lots of branches complicating and slowing down parsing. Ditto printing. The price to pay in bloat might be annoying, so we might need ./configure options for disabling various features.

Jul 06 '23 19:07 nicowilliams

OK this PR has been updated and now covers the following:

C- and C++-style comments (but not # comments) are stripped
this behavior is only enabled when a --strip-comments option is enabled
added a couple tests in tests/commentstest

I understand there may not yet be consensus around the use of a --strip-comments option, but figured I'd include it for now and you can decide when you merge if you don't want it, or want to change it (or lmk and I can update the PR).

Jul 07 '23 13:07 liquidaty

@liquidaty ah, do please get rid of the merge commit. Neither @stedolan nor I like them -- we prefer linear history upstream. If you don't know how to get rid of it I could walk you through it, or I could force push to this PR.

Jul 09 '23 15:07 nicowilliams

Will this PR be going to 1.7? We need more discussions on this topic. As a user, it does not mention the formal definition about the comment syntax, so it's not clear which form of comments are stripped; e.x. multiline comments can be nested or not. Dealing with comments in JSON will soon raise a feature request to re-format a JSON file with keeping comments, but I'm afraid such request is hard to implement within the current parser.

Jul 09 '23 15:07 itchyny

Will this PR be going to 1.7? We need more discussions on this topic. As a user, it does not mention the formal definition about the comment syntax, so it's not clear which form of comments are stripped; e.x. multiline comments can be nested or not. Dealing with comments in JSON will soon raise a feature request to re-format a JSON file with keeping comments, but I'm afraid such request is hard to implement within the current parser.

Great question! I think yeah, it's probably best to wait till after 1.7 to integrate this.

Because the jv representation of JSON values really can't represent comments without a lot of bloat, I think we'll never agree to preserve comments.

Jul 09 '23 15:07 nicowilliams

@liquidaty ah, do please get rid of the merge commit [...] If you don't know how to get rid of it I could walk you through it, or I could force push to this PR

@nicowilliams I unfortunately do not know (for sure) how to get rid of it and would be happy to try, but I think at least as likely to make it worse than better. Please feel free to walk me thru it or force push yourself, whatever is more convenient for you

Jul 10 '23 01:07 liquidaty

Will this PR be going to 1.7? We need more discussions on this topic

probably best to wait till after 1.7 to integrate this

@itchyny @nicowilliams: regarding the limited scope of just stripping C- and C++-style behavior is (i.e. 1a from https://github.com/jqlang/jq/pull/2548#issuecomment-1622664788), then given:

how uncontroversial that limited scope is
how many users want this (this has got to be a top-10 most asked-for feature, maybe top 3, maybe top 1: see e.g. #1571, #402, #695, #1398), and
easy it would be to include in 1.7 (since we already have here a fully working PR that passes all checks, with the only remaining difference of opinion being over what description is used in help and/or man pages-- and I'm fairly confident that the majority of users looking for this solution won't mind whatever outcome you decide on that)

then, I hope you will consider, for your own sakes as well as that of others waiting on this feature: just accept it in 1.7, thereby putting the whole bunch of users looking for this at ease, and buying yourself plenty of time to later decide on all the other issues in the broader comment-related scope

Jul 10 '23 02:07 liquidaty

@liquidaty here's what I did:

git fetch origin pull/2548/head:pr2548
git checkout pr2548
git rebase -i origin/master
- fix conflicts
the merge commit was dropped automatically by the rebase
reviewed and noticed that one commit was mostly the reversal of the other
git rebase -i origin/master
- mark 669cac5 for edit, save, let the rebase run
- git checkout -f HEAD^ -- src/main.c
- git commit --amend
- git checkout 669cac5 -- src/main.c
- git commit --fixup HEAD^
- git rebase --continue
git rebase -i origin/master again
- move the update --strip-comments description; remove lexer changes commit to right after add c-style comment stripping from jq program input, then save and let the rebase run
- notice that now there are no diffs between allow, and strip, comments (#, // or /*) and update --strip-comments description; remove lexer changes
git rebase -i origin/master again and mark those to commits to be droped, let it run
notice that the last commit should be part of the previous commit, so git rebase -i origin/master once again and squash the last commit into the previous (I could just have done git checkout HEAD^ && git commit --amend instead), but also mark the first commit for reword and capitalize the first letter of its synopsis, then do the same for the now last commit
git remote add liquidaty https://github.com/liquidaty/jq
git remote set-url --push liquidaty [email protected]:liquidaty/jq
git push -f liquidaty HEAD:strip-comments

Jul 10 '23 02:07 nicowilliams

@liquidaty you'll want to review the new PR HEAD, and you'll want to fetch it as well and compare to your local branch so you can use git diff to see the differences. The only differences should be: a) all the things in jqlang/jq's master that you didn't already have, b) the lexer changes that got undone.

Jul 10 '23 02:07 nicowilliams

@nicowilliams awesome! super helpful and definitely not something I would have done correctly, so thank you.

I do see some minor changes that were dropped, that I probably should add back in-- they are minor in that they will not show up in the current checks where input is piped directly into jq, but without these changes added back in, they would show up if, like in jq_test.c, the input was first parsed using jv_parse(), which relies on the return code being 0 and not OK.

Just to confirm, if I now just make those changes (touching a few lines to jv_parse.c to return 0 instead of OK) to origin/strip-comments and push them as a new commit, it will retain the history the way you prefer, correct?

Jul 10 '23 03:07 liquidaty

Just to confirm, if I now just make those changes (touching a few lines to jv_parse.c to return 0 instead of OK) to origin/strip-comments and push them as a new commit, it will retain the history the way you prefer, correct?

You'll want to apply those changes on top of the commit I pushed. Something like:

git fetch origin pull/2548/head:pr2548
git checkout pr2458
git add ...
git commit -m ...
git push $your_repos_name pr2548:strip-comments

and then whatever your old branch was in your local clone, you can now abandon it, delete it, or make its HEAD be the same as pr2548's (git branch -f $that_local_branch pr2548).

Jul 10 '23 03:07 nicowilliams

@nicowilliams good to go now, thanks for your help!

Jul 10 '23 05:07 liquidaty

One reason to wait till after 1.7 for this is that I've been fuzzing the current state of the parser, so I'd rather not touch the parser until after 1.7 :)

Jul 10 '23 16:07 nicowilliams

@nicowilliams I see. Totally off-topic: out of curiosity, what fuzzing tools are you using? am new to the art of fuzzing and looking to learn more about how it's being done in practice. Also have been considering suggesting some simd-related improvements to jq's parser, so might come in handy to understand the fuzzing context in which such suggestions might be considered

Jul 10 '23 17:07 liquidaty

@nicowilliams I see. Totally off-topic: out of curiosity, what fuzzing tools are you using? am new to the art of fuzzing and looking to learn more about how it's being done in practice. Also have been considering suggesting some simd-related improvements to jq's parser, so might come in handy to understand the fuzzing context in which such suggestions might be considered

I'm using AFL, because I know how to use it and I've used it before. But see #2255 and #2688.

Some fuzzers like LibFuzzer and HongFuzz require that you provide a function that takes a pointer to data and a length, and then you link with their object files to produce an executable that fuzzes that function, and when you run it you have to provide an initial test corpus. AFL requires (mostly) that you build with a version of gcc or clang that instruments the code, and then you you run the executable under afl-fuzz with options that specify an initial test corpus.

Jul 10 '23 17:07 nicowilliams

@nicowilliams thank you, much appreciated

Jul 12 '23 20:07 liquidaty

@liquidaty can you rebase this?

Jan 16 '24 22:01 nicowilliams

If it's ok i can rebase this but not sure if i can push to the PR branch?

Feb 17 '24 08:02 wader

If it's ok i can rebase this but not sure if i can push to the PR branch?

What do you mean? Just rebase it on a fork and submit it as a PR, it'll create a "branch" automatically. You don't need to ask for permission. Maybe I'm misunderstanding something, though.

Feb 24 '24 16:02 God-damnit-all

jq jq copied to clipboard

Allow, and strip, comments (#, // or /*)

jq
jq copied to clipboard