dhall-haskell Strict comment parsing

trafficstars

This PR adds a way to control if comments should be parsed as whitespace.

I believe many of the issues we are seeing where we lose comments when formatting Dhall (#2172, #2109, and probably others) stem from the fact that at the core of the parser comments are treated as whitespace. We have since added some logic to recover comments from whitespace source spans and keep track of them in the AST with some success. Unfortunately it is very easy to leave out places in the AST where comments could be and silently lose them. So I think the way forward is instead to parse whitespace and comments separately when doing formatting (or anything that pretty prints the original source code). This is not necessary and probably redundant work for the parser when typechecking/normalizing or doing anything that doesn't need to preserve comments for pretty printing (i.e. it is okay to lose comments when normalising Dhall).

My approach is to make what we consider whitespace configurable. When we care about comments we should not consider them to be whitespace. I have set the default for this configuration to parsing whitespace/comments the same as it was before so that everything works as usual. But now dhall has an extra flag --strict-comment-parsing used by format, freeze, lint, and schema which explicitly doesn't consider comments to be whitespace. Currently it basically errors out (while parsing) on any comment, which is not immediately useful. My intention is to slowly move the logic we have for parsing comments via whitespace to explicitly parsing comments instead.

@Profpatsch on twitter:

I’d much rather it errored out than throwing away comments tbh.

I agree with him in that until we are certain that we don't lose any comments, no matter their location, we should provide a way to make dhall format error out loudly to make it obvious that a comment is being lost. I believe this --strict-comment-parsing flag could be a solution to this (at least once I get all the comments we currently do keep to not trigger an error :sweat_smile: ). This will help us keep track of which comment locations we do support in the meantime (both manually using the flag and in test suites).

The main reason I am opening this PR as a draft in the such an early state of progress is because I suspect others have thought about this issue as well and maybe have better ideas on how to tackle it. Are there any issues I missed discussing something like this?

I don't know if this approach is great, I expect there might be some performance regression when parsing since I have added an extra level of ReaderT to the parsing Monad. I would need to run some of the benchmarks to make sure either way. The environment provided as part of ReaderT is rather static (a comment is considered whitespace or it is not), so maybe a different solution would be more appropriate (ImplicitParams, manually threading the config).

What do you think? Should I carry on with this approach? :smile:

Apr 19 '21 21:04 basile-henry

@basile-henry: Yeah, I think this is a good idea. My only suggestion is to make the strict behavior the default, with a flag to opt out and recover the old behavior

Apr 20 '21 02:04 Gabriella439

My only suggestion is to make the strict behavior the default, with a flag to opt out and recover the old behavior

Should the strict behaviour only be used for format, freeze, and similar commands that modify the source code? I am not sure we want strict parsing for evaluation for example, dropping comments when evaluating is not really an issue :sweat_smile:

Apr 25 '21 11:04 basile-henry

@basile-henry: Yeah, I meant to say just for the modification commands like dhall {format,freeze,lint}

Apr 30 '21 07:04 Gabriella439

I believe I have managed to convert all the existing source span based comment parsing to explicit comment parsing. There are lots of places that still need explicit comment parsing but I think this can be done in separate PRs.

I have added a QuickCheck based test (disabled by default) to check more systematically where we don't support comments yet, so hopefully it should help us finding all the comments we are dropping.

It would be great if others could review this PR and try it to see if this new strict formatting makes sense to them. They are a few things I am not sure about that might be worth discussing:

The name of the flag: --loose-comment-parsing is a bit long and easy to forget, any suggestions? Would lax instead of loose be better?
Currently this flag doesn't appear in dhall format --help but only in dhall --help. Should I move it (and also add it to the relevant freeze, lint...)?
This PR changes the pretty printing of comments to try to normalize them somewhat (merging block comments) but this currently affects some of the leading whitespace in comments. Maybe there's better logic I could use there to format where the comments are placed without changing their whitespace content. I think in particular the current logic could be an issue for ASCII arts. Any suggestions about what the logic should be instead?

May 20 '21 15:05 basile-henry

Thanks for the review! 😄

It looks like my latest changes to the comment rendering logic broke the idempotence tests. I will have another go at fixing this, maybe by going back to stripping leading whitespace in comments. But I won't have the time today or tomorrow.

May 22 '21 12:05 basile-henry

It looks like the way I am rendering the MultiComments (by merging them) is incompatible with some of the assumptions around how comments are kept after formatting in dhall-docs.

I will remove this merging behaviour. Hopefully that should be enough to get dhall-docs to keep on working as expected. In a future PR it might make sense to unify the comment parsing logic between dhall the library and dhall-docs. I believe a bit more info will need to be kept in order to keep groups of single line comments as groups, something I haven't considered in this PR.

May 24 '21 17:05 basile-henry

@basile-henry: Another option might be to change dhall-docs to permit those comments. It might actually be a feature to support multiple block comments for documenting something and merging them in the same way as line comments

May 30 '21 17:05 Gabriella439

@Gabriel439

You're right, it might be nice to merge consecutive doc comments as well.

I think the issue is merging doc comments with plain comments. That is probably unexpected behaviour as it adds documentation that was intended as a plain comment. I will need to handle that specifically.

Slow progress, I thought this PR was done 😅 I should have some time tomorrow to finish this new bit.

May 30 '21 18:05 basile-henry

Chiming in to give some cheers :1st_place_medal:

Jun 14 '21 09:06 Profpatsch

Thanks! Cheers appreciated :smile:

I will fix the remaining dhall-docs test failures ASAP so we can merge this soon. :crossed_fingers:

Once this PR goes in, the main difference to the users of dhall format will be that fewer programs will parse by default. This doesn't sound like a great feature (:sweat_smile:) but hopefully this should enable us to go and explicitly parse comments everywhere they can be.

Jun 14 '21 10:06 basile-henry

dhall-haskell dhall-haskell copied to clipboard

Strict comment parsing

dhall-haskell
dhall-haskell copied to clipboard