meson icon indicating copy to clipboard operation
meson copied to clipboard

Draft: Add unstable formatter

Open JCWasmx86 opened this issue 2 years ago • 31 comments

Opening this for eventual discussion.

This is a draft.

Things that are at least done partially:

  • Test cases (Albeit they will be extended over the time)
  • Release note snippet
  • Added to commands.md
  • Add ParenthesizedNode in order to be able to reconstruct the AST. The other tests pass, so I assume there will be no fallout for other components.
  • Formatting seems to be idempotent, e.g. format(format(code)) seems to give the same output as format(code) in around 90%
  • The formatting attempts to approximate the formatting of muon, but does not match 100% (More like 80%)
  • 94% code coverage for formatter.py

Things that aren't good at the moment:

  • Readding comments is especially in weird locations quite spotty.
  • The maximum line length is not a hard limit, e.g. if you have a limit of 80, some lines can sometimes reach around 85-87 chars.

Important things before merging:

  • Remove all unrelated output like the comments that couldn't be readded or the warning that it may eat your comments

Important non-code things before merging:

  • Cleanup git history

And maybe setup a CI-workflow that downloads some big meson files (Like from mesa, GNOME-Builder) to check for these things:

  • That all comments are readded
  • Idempotency
  • No invalid code is generated

JCWasmx86 avatar Oct 31 '22 15:10 JCWasmx86

It might help to split some of these commits into their own PRs. For instance, the parentesized node stuff

tristan957 avatar Oct 31 '22 15:10 tristan957

Probably also logically squashing commits would help too

tristan957 avatar Oct 31 '22 15:10 tristan957

This pull request introduces 2 alerts when merging 6a0d258c6ee735ffbce4491da3cf9e7afd940419 into 0ddca4d0d0513e9ce1721902393019d4a3903bda - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Implicit string concatenation in a list

lgtm-com[bot] avatar Oct 31 '22 15:10 lgtm-com[bot]

This pull request introduces 1 alert when merging 1a6b24d8c9ea9bc56a09643ec02853bc13671987 into 0ddca4d0d0513e9ce1721902393019d4a3903bda - view on LGTM.com

new alerts:

  • 1 for Unused import

lgtm-com[bot] avatar Oct 31 '22 16:10 lgtm-com[bot]

What is missing to help put comments in the right places?

tristan957 avatar Nov 01 '22 16:11 tristan957

It's complicated, I can pinpoint it to exactly one thing. A bit how the comment-readding heuristics work:

Basically everything are CodeBlockNodes. At the start of each CodeBlock, I check, whether there are some comments before, e.g.

# Foo
# Bar
x = 5
y = 5

Then for every line, I check, whether there is a comment block above

x = 5
# FooBar
y = 6

and whether a comment is inline and after a line

y = 6 # FooBar

And after the codeblock I check if there is a trailing code block.

y = 6
# Foo
# Bar
<EOF>

This works as long as you do normal stuff and don't put comments in weird locations like this one: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/meson.build#L540 as the comment is in the middle of an expression and I currently don't break up many expressions. The comments below are readded, because visit_ArgumentsCall is probably adding those

But at least it is the only comment remove in this big meson file

JCWasmx86 avatar Nov 01 '22 16:11 JCWasmx86

So basically just more checks for comments is all that is missing?

tristan957 avatar Nov 01 '22 16:11 tristan957

A bit simplified, but yes, that's correct.

JCWasmx86 avatar Nov 01 '22 16:11 JCWasmx86

This pull request introduces 1 alert when merging 5522fe3bbfd38d0919c294479c6488d7dcb1642f into 97ec20e90142c229f62a6d20371f44df0b8dd41e - view on LGTM.com

new alerts:

  • 1 for Unused import

lgtm-com[bot] avatar Nov 05 '22 20:11 lgtm-com[bot]

Does it make sense to add a CI pass that does the following:

  1. Clone a few big repos, because more code means more coverage of existing constructs (E.g. systemd, mesa)
  2. Reformat all meson files + All comments must be readded
  3. Check that all files are parseable
  4. Check that it configures (Optional)
  5. Reformat all meson files and assert that they didn't change

As the formatter is platform-independent, it just would have to run on e.g. linux and e.g. only if the file formatter.py was changed

JCWasmx86 avatar Nov 12 '22 19:11 JCWasmx86

This pull request introduces 1 alert when merging 4be7fae872631990be7d4ee8609014358ccb5c95 into 611bd4e6dbbf4c88321bebe3b35e4bd3d9a0e5a2 - view on LGTM.com

new alerts:

  • 1 for Unused import

lgtm-com[bot] avatar Nov 12 '22 19:11 lgtm-com[bot]

@annacrombie what are your thoughts here?

tristan957 avatar Nov 13 '22 22:11 tristan957

I just checked and muon also deletes comments after a opening paren, e.g.

a = ( # comment
1
)

This is impossible handle with the strategy both muon and this PR use because you only have so many nodes to attach comments to, but you can comment anywhere inside a parenthesized expression. For example, this expression:

(a = (1+2))

corresponds roughly to the following AST:

  paren
    |
assignment
  /   \
id     paren
         |
        plus        
        / \
       1   2

Which gives us 7 nodes to attach comments to.

But it is entirely possible to write:

( # comment 1
# comment 2
# comment 3
a # comment 4
= # comment 5
# comment 6
( # comment 7
1 # comment 8
+ # comment 9
2 # comment 10
) # comment 11
# comment 12
) # comment 13

Which already greatly exceeds the amount of slots we have, and this example could technically have an unlimited amount of comments inside the () anyway.

muon partially solves this problem for lists, by adding another special formatting node fmt_eol, which keeps newlines inside of [], {}, and () expressions (the standard parser throws them away), and special code in the list parser to handle them. Extending this handling to the general parser so that newlines could be handled anywhere inside a statement could be difficult, I'm not sure.


Unrelated, but @JCWasmx86, how would you feel about renaming some of the formatting options, e.g. kwa_ml? I find it isn't a very intuitive name and would be happy to change it in muon.

annacrombie avatar Nov 15 '22 13:11 annacrombie

This pull request introduces 2 alerts when merging a3be21867d69ada18f31dde399076d8234a091e4 into 8dfa55005e4f6386e54ca53909b66933e5930858 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Nested loops with same variable

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.

lgtm-com[bot] avatar Nov 15 '22 18:11 lgtm-com[bot]

Thanks a lot for your thoughts. With those new commits (I will clean it up, as soon as #10972 is reviewed+merged, I was able to find those numbers:

Project name Number of comments Number of lost comments Percentage of lost comments
systemd 340 0 0.00%
mesa 3931 1 0.025%
gnome-builder 335 0 0.00%
gstreamer 1045 6 (*) 0.57%
gtk 273 1 0.36%
glib 619 3 0.48 %

As you can see from these huge meson-code bases (Assuming a lot of syntax is used), we only have a quite low error rate. For gstreamer, 36 of those are from one bug, the other 6 are more difficult to solve

(*) Those were 42, but I fixed a bug, so only 6 comments are lost now.

muon partially solves this problem for lists, by adding another special formatting node fmt_eol, which keeps newlines inside of [], {}, and () expressions (the standard parser throws them away), and special code in the list parser to handle them. Extending this handling to the general parser so that newlines could be handled anywhere inside a statement could be difficult, I'm not sure.

I have a lot of special cases, too in order to improve the formatting output, but I use more some heuristics and a bit of praying. But interesting approach, I will look into it.

But it is entirely possible to write:

I would say we shouldn't optimize for the 0.001% of insane code. Sure you could consider it a bug, but at the end you probably invest a few hours/days to have a similar formatting for this one insane case, while using this time would have benefited a huge majority. I think those really-really-edge cases aren't important right now.

Unrelated, but @JCWasmx86, how would you feel about renaming some of the formatting options, e.g. kwa_ml? I find it isn't a very intuitive name and would be happy to change it in muon.

Yes, just go ahead, I will follow you :)

JCWasmx86 avatar Nov 15 '22 18:11 JCWasmx86

And in addition to that, I did a few experiments regarding idempotency. After maximum 3 iterations every project i have tested did converge to one formatting, so it would be probably wise to do something like:

lines = ...
for i in range(0, MAX_ITERATIONS):
    new_lines = format(lines)
    if equal(lines, new_lines):
       break
    lines = new_lines

Sure it would cost a bit of performance, but I think that it is worth it

JCWasmx86 avatar Nov 15 '22 18:11 JCWasmx86

This pull request introduces 1 alert when merging 4fa74cb8905adb320d39002f6f3f7a26ada3d7e8 into 8dfa55005e4f6386e54ca53909b66933e5930858 - view on LGTM.com

new alerts:

  • 1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.

lgtm-com[bot] avatar Nov 15 '22 19:11 lgtm-com[bot]

This pull request introduces 1 alert when merging b81eb654a0d6738daa59b6d93ce78b19fedf3f3d into 8ee4660788df678b64bb2986b5f2f8474c127ae2 - view on LGTM.com

new alerts:

  • 1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.

lgtm-com[bot] avatar Nov 16 '22 07:11 lgtm-com[bot]

This pull request introduces 1 alert when merging 910c7c2206fc18ca441d07b8d559a5164a5f62da into 8ee4660788df678b64bb2986b5f2f8474c127ae2 - view on LGTM.com

new alerts:

  • 1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.

lgtm-com[bot] avatar Nov 16 '22 08:11 lgtm-com[bot]

This pull request introduces 1 alert when merging f0ed332ed8bf2e68d659b5173b8b38e4770967b0 into bfc813200c8c4401985151216390f860246e0d71 - view on LGTM.com

new alerts:

  • 1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.

lgtm-com[bot] avatar Nov 16 '22 08:11 lgtm-com[bot]

Codecov Report

Merging #10971 (bfc8132) into master (8ee4660) will not change coverage. The diff coverage is 100.00%.

:exclamation: Current head bfc8132 differs from pull request most recent head df4d72a. Consider uploading reports for the commit df4d72a to get more accurate results

@@           Coverage Diff           @@
##           master   #10971   +/-   ##
=======================================
  Coverage   68.58%   68.58%           
=======================================
  Files         412      412           
  Lines       87861    87861           
  Branches    20728    20728           
=======================================
  Hits        60261    60261           
  Misses      23093    23093           
  Partials     4507     4507           
Impacted Files Coverage Δ
mesonbuild/interpreter/interpreter.py 84.58% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Nov 16 '22 08:11 codecov[bot]

This pull request introduces 1 alert when merging df4d72a5b6febe0c22107f0c79425de9a192db96 into bfc813200c8c4401985151216390f860246e0d71 - view on LGTM.com

new alerts:

  • 1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.

lgtm-com[bot] avatar Nov 16 '22 09:11 lgtm-com[bot]

Add a script for automatically formatting a project, counting the number of lost comments and how long it takes to converge to one formatting, example results:

systemd has 340, lost 0 during formatting (0.0%)
systemd needed 2 formatting passes to be stable
mesa has 3931, lost 1 during formatting (0.025%)
mesa needed 3 formatting passes to be stable
gnome-builder has 335, lost 0 during formatting (0.0%)
gnome-builder needed 2 formatting passes to be stable
gstreamer has 1338, lost 6 during formatting (0.448%)
gstreamer needed 2 formatting passes to be stable
gtk has 273, lost 1 during formatting (0.366%)
gtk needed 2 formatting passes to be stable
glib has 619, lost 3 during formatting (0.485%)
glib needed 2 formatting passes to be stable
Found 6836 comments, but lost 11 during formatting (0.161%)

This shows that probably after five passes the vast majority of files won't change

JCWasmx86 avatar Nov 16 '22 09:11 JCWasmx86

I think that a formatter that has the potential to loose the content of the source files is a terrible idea. It would be acceptable if it would misplace the comments slightly but loosing them is not acceptable IMO. But then, I don't share the excitement for automatic code formatters that some people seem to be subject to.

dnicolodi avatar Nov 16 '22 09:11 dnicolodi

@dnicolodi I agree with you. But as you can set comments everywhere, as demonstrated above, readding all at the right place would be probably near impossible.

At the moment, only 0.161% of all comments are lost, or in other words 1 comment in 621 comments is lost.

One thing this formatter could do, is to maintain some mapping of line_in_input -> line_in_output, e.g. the statement that was in line 54 is now in line 57. And if there is e.g. some comment left, it could attempt to sprinkle it around. E.g. there was a comment on line 54, too, so simply place it at the end of line 57 in the output.

What do you think of that?

JCWasmx86 avatar Nov 16 '22 09:11 JCWasmx86

After thinking a little bit harder about the problem, I think I came up with a decent solution for comment preservation in muon, and maybe you can use it too:

  1. Any time a comment token is encountered in the token stream, append it to an array of comments attached to the most recently parsed node.
  2. During formatting, after outputting every node, check if there were comments attached to it and print them all out one by one.

This means that at least no comments are lost (test case).


About convergance, I really think you should strive to have the formatter converge after only 1 pass, but I suppose that could be left for future work.

annacrombie avatar Nov 16 '22 17:11 annacrombie

It might also be a good idea to go ahead and format all the meson source files in this repo, and then add some CI to enforce that they stay formatted, but that should probably be a separate PR.

annacrombie avatar Nov 16 '22 17:11 annacrombie

Oh yeah I also updated the name of kwa_ml to kwargs_force_multiline.

annacrombie avatar Nov 16 '22 17:11 annacrombie

Any time a comment token is encountered in the token string, append it to an array of comments attached to the most recently parsed node.

That's a quite good idea. I will try it out. But if the changes in the parser are too big I probably must split it up in yet another PR that is a dependency of this PR then, as this PR is already an insane amount of changes to review (~1300LOC)

About convergance, I really think you should strive to have the formatter converge after only 1 pass

As long as we do multiple passes only internally you can't observe from the outside that we needed multiple passes, so it's not that horrible IMO.

Oh yeah I also updated the name of kwa_ml to kwargs_force_multiline.

Ok, cool I will change it here, too, thanks for the notice

JCWasmx86 avatar Nov 16 '22 18:11 JCWasmx86

Any time a comment token is encountered in the token stream, append it to an array of comments attached to the most recently parsed node.

Regarding this I did experiment a little bit, but that would require rewriting the majority of the parser, as the comments are already lost when the parser is called, as these are stripped out by the lexer.

And doing parsers/lexers is far outside of my skillset and that's something I can't spend my time on learning it.

JCWasmx86 avatar Nov 18 '22 15:11 JCWasmx86