dune icon indicating copy to clipboard operation
dune copied to clipboard

Add `.messages` support for Menhir

Open nojb opened this issue 8 months ago • 12 comments

An attempt to add support for .messages file to the Menhir stanza, following the suggestion made in https://discuss.ocaml.org/t/dune-wish-list-for-2023/11083/60 by @Gbury (look at the documentation in the manual and the source for the details of which rules are used to regenerate the .messages file).

Note that the best workflow to work with .messages files is a bit of an open question, so the implementation here may not be the best one. It has the upside of being completely automated and using promotion. Ideally actual users of this facility would try it out and report back with any issues or saying that it works well for them. A full test (adapted from https://gitlab.inria.fr/fpottier/menhir/-/blob/master/demos/calc-syntax-errors) is included.

cc @Gbury @MisterDA @fpottier

Fixes #3284

nojb avatar May 04 '25 14:05 nojb

I haven't had a chance to test this, but I have a question: what's the expected behavior when there are multiple modules passed to Menhir?

Currently a dune stanza like (menhir (modules parser1 parser2)) will produce two parsers, parser1.ml(i) and parser2.ml(i). Will (menhir (modules parser1 parser2) (messages parser.messages)) try to compile the .messages file twice, once for each module?

And I assume that multiple (messages) stanzas will need to be split across multiple (menhir) stanzas? E.g.: (menhir (modules parser1) (messages parser1.messages)) (mehhir (modules parser2) (messages parser2.messages)).

johnridesabike avatar May 06 '25 13:05 johnridesabike

Thanks for the feedback @johnridesabike!

what's the expected behavior when there are multiple modules passed to Menhir?

I admit I did not think about this case. I guess that it doesn't make sense to provide a single .messages file for all parsers, so either:

  1. only allow (messages) field when a single parser is being generated, or
  2. we tweak the syntax of the stanza to associate a (messages) field to each individual parser, ie something like:
(menhir
 (modules
  ((name parser1) (messages messages1))
  ((name parser2) (messages messages2))))

Do you have any opinion on the question?

nojb avatar May 06 '25 14:05 nojb

I don't think I have a strong opinion on either of those options. Option 1 feels like it would be simpler. Option 2 feels like it would make more sense if all of the other Menhir options (flags, etc.) were also available per-module as well. With that being said, I'm in favor of whichever option is easiest.

johnridesabike avatar May 06 '25 15:05 johnridesabike

I don't think I have a strong opinion on either of those options. Option 1 feels like it would be simpler. Option 2 feels like it would make more sense if all of the other Menhir options (flags, etc.) were also available per-module as well. With that being said, I'm in favor of whichever option is easiest.

I went with 1.

nojb avatar May 08 '25 09:05 nojb

I had a chance to play with this a little bit. A few quick thoughts:

  1. I think that --compare-errors should also be used somewhere, since it can report redundant sentences. ("Error: these sentences both cause an error in state ...") When I've hand-written Dune rules for messages, that's one of the commands I've used. This PR currently just keeps the redundant sentences.
  2. Merging the generated errors also merges all of the generated comments, which IMO are very verbose and make the file less readable. (Without changing anything else, it turned one of my .messages files from 250 lines to 1100 lines.) I wonder if it's possible to opt-out of printing those?
  3. Diffing and merging errors does provide a nicer workflow for adding missed sentences though, since --compare-errors just raises an error if one is missing.

Besides that, it seems like a solid start for this feature!

johnridesabike avatar May 12 '25 00:05 johnridesabike

I had a chance to play with this a little bit. A few quick thoughts:

Thanks for taking the time to provide feedback!

  1. I think that --compare-errors should also be used somewhere, since it can report redundant sentences. ("Error: these sentences both cause an error in state ...") When I've hand-written Dune rules for messages, that's one of the commands I've used. This PR currently just keeps the redundant sentences.

Yes, this is in point which I was not sure about. I will look into it and report back here.

  1. Merging the generated errors also merges all of the generated comments, which IMO are very verbose and make the file less readable. (Without changing anything else, it turned one of my .messages files from 250 lines to 1100 lines.) I wonder if it's possible to opt-out of printing those?

Sounds reasonable. What do you think about removing them unconditionally? They can always be found in the "new" .messages file (inside the _build directory).

nojb avatar May 18 '25 06:05 nojb

  1. I think that --compare-errors should also be used somewhere, since it can report redundant sentences. ("Error: these sentences both cause an error in state ...") When I've hand-written Dune rules for messages, that's one of the commands I've used. This PR currently just keeps the redundant sentences.

Yes, this is in point which I was not sure about. I will look into it and report back here.

Well spotted. I pushed a commit that now runs --compare-errors before generating the .ml file.

  1. Merging the generated errors also merges all of the generated comments, which IMO are very verbose and make the file less readable. (Without changing anything else, it turned one of my .messages files from 250 lines to 1100 lines.) I wonder if it's possible to opt-out of printing those?

Sounds reasonable. What do you think about removing them unconditionally? They can always be found in the "new" .messages file (inside the _build directory).

I implemented this. It is not optimal because since the rule is sandboxed there is no easy way to get at the file with the comments in case they are needed for any reason. I am still thinking if there is a better way.

If you could give it a spin again and report any issues, that would be great. Thanks again @johnridesabike!

nojb avatar May 18 '25 10:05 nojb

I just gave it another quick test-drive, and the new behavior feels quite good now. It fits very comfortably with my workflow. Some other thoughts:

Maybe the rule could generate a second .messages file, say parser_messages_comments.messages to accompany parser_messages.messages? Then that could exist in the _build directory just in case it's needed.

One more suggestion, and this purely my personal taste: I've always disliked how Menhir prints messages diagnostics to stderr, e.g. Read 100 sample input sentences and 100 error messages.. Even this PR's run.t file shows each build printing that several times. Since Dune otherwise follows the philosophy of not printing anything on success states (unless requested by a flag like --verbose), perhaps this rule could suppress that?

Thanks for working on this!

johnridesabike avatar May 18 '25 20:05 johnridesabike

Maybe the rule could generate a second .messages file, say parser_messages_comments.messages to accompany parser_messages.messages? Then that could exist in the _build directory just in case it's needed.

In fact, the file with the comments is already generated by the rule in the _build directory. The issue is that the whole rule is "sandboxed" by Dune which means that one cannot easily access the file (since the "sandbox" is removed by Dune after the rule runs). This is more of a general issue with Dune though.

Since Dune otherwise follows the philosophy of not printing anything on success states (unless requested by a flag like --verbose), perhaps this rule could suppress that?

The problem is that the way to supress it would be to redirect stderr to /dev/null, but then we will also supress any eventual error message. I'm afraid that the right fix here is for menhir itself not to output these diagnostic messages unless a verbose flag is passed.

nojb avatar May 19 '25 07:05 nojb

I haven't (yet) had time to look at this too deeply, but I have one question/remark: if I understand correctly, the error messages files generated by the new rules are processed to remove the comments generated by menhir ? If that is correct, I'd argue for at the very least an option to keep them, and I'd argue that keeping them would be the reasonable default.

In practice these comments are very informative and helpful in writing correct error messages: they mention the actual parsing rules from the menhir file, and explain in more directly human-understandable terms the error state, what the parser was trying to do, and what were the acceptable tokens/reductions, which is often times more helpful than the sequence of tokens leading to the error (including the last token which triggered the error). Indeed the sequence of tokens used to reach an error state is often misinterpreted, especially (but not only) by beginners / non-experts, who may very well misunderstand all of the ways a particular error state will be reached, and thus tend towards over-specific error messages that do not adequately match all of the ways to reach the error state. In practice, writing good error messages thus need the developer to look at the parsing rules to determine where the input token sequence leads to, identity the point where the parser would be, and see what went wrong / what was expected by the parser, and all of that work is done automatically and explained in the comment.

For all of these reasons, I'd at least want a way to keep these comments, and I'd argue that for non-experts (me included, and I maintain a library of parsers for different languages, so I'm a bit familiar with it) it's better to have these comments than not, and therefore the default should be to have them. People that have the sufficient knowledge, and who care about the size of the file, can then turn these off.

Gbury avatar May 23 '25 19:05 Gbury

Thanks @Gbury for chiming in, and your valuable feedback!

I haven't (yet) had time to look at this too deeply, but I have one question/remark: if I understand correctly, the error messages files generated by the new rules are processed to remove the comments generated by menhir ? If that is correct, I'd argue for at the very least an option to keep them, and I'd argue that keeping them would be the reasonable default.

Got it. The original version of this patch did indeed keep the comments, but after some discussion I currently remove the comments before generating the final .messages file. It would be easy to add an option to the (menhir) stanza to specify whether one wants to keep the comments or not. It would not be ideal from a usability standpoint (need to edit the Dune file if one just wants to peek at the comments but not commit them), but would at least allow to opt-in/-out of keeping the comments.

Does having an option in the (menhir) stanza sound reasonable to you?

nojb avatar May 25 '25 19:05 nojb

I think having an option to keep or remove the comments sounds reasonable. Perhaps messages could use the default behavior (keep the comments) and something like messages_minimal could strip them?

johnridesabike avatar May 26 '25 16:05 johnridesabike

Apologies, I completely forgot this.

Any of the options mentionned by @johnridesabike or @nojb works for me: basically, I mainly want at least a way to keep the comments in the messages file. Additionally, I'd also argue that keeping the comments should be the default, because it's more friendly (and unlikely to really be a problem in most cases).

Gbury avatar Aug 04 '25 12:08 Gbury

Gentle ping; is there anything blocking this change?

Thanks for the ping! This dropped off my radar with the holidays :) As far as I remember the "blocking" point is that it was not completely clear that the workflow implemented was the right one. But since this is tied to the version of the Menhir stanza, perhaps we should go ahead, and we can always bump the version if we need to improve it.

I will try to get back to it soon.

nojb avatar Sep 19 '25 14:09 nojb