meson
meson copied to clipboard
Draft: Add unstable formatter
Opening this for eventual discussion.
This is a draft.
Things that are at least done partially:
- Test cases (Albeit they will be extended over the time)
- Release note snippet
- Added to commands.md
- Add ParenthesizedNode in order to be able to reconstruct the AST. The other tests pass, so I assume there will be no fallout for other components.
- Formatting seems to be idempotent, e.g.
format(format(code))
seems to give the same output asformat(code)
in around 90% - The formatting attempts to approximate the formatting of muon, but does not match 100% (More like 80%)
- 94% code coverage for formatter.py
Things that aren't good at the moment:
- Readding comments is especially in weird locations quite spotty.
- The maximum line length is not a hard limit, e.g. if you have a limit of 80, some lines can sometimes reach around 85-87 chars.
Important things before merging:
- Remove all unrelated output like the comments that couldn't be readded or the warning that it may eat your comments
Important non-code things before merging:
- Cleanup git history
And maybe setup a CI-workflow that downloads some big meson files (Like from mesa, GNOME-Builder) to check for these things:
- That all comments are readded
- Idempotency
- No invalid code is generated
It might help to split some of these commits into their own PRs. For instance, the parentesized node stuff
Probably also logically squashing commits would help too
This pull request introduces 2 alerts when merging 6a0d258c6ee735ffbce4491da3cf9e7afd940419 into 0ddca4d0d0513e9ce1721902393019d4a3903bda - view on LGTM.com
new alerts:
- 1 for Unused import
- 1 for Implicit string concatenation in a list
This pull request introduces 1 alert when merging 1a6b24d8c9ea9bc56a09643ec02853bc13671987 into 0ddca4d0d0513e9ce1721902393019d4a3903bda - view on LGTM.com
new alerts:
- 1 for Unused import
What is missing to help put comments in the right places?
It's complicated, I can pinpoint it to exactly one thing. A bit how the comment-readding heuristics work:
Basically everything are CodeBlockNode
s. At the start of each CodeBlock, I check, whether there are some comments before, e.g.
# Foo
# Bar
x = 5
y = 5
Then for every line, I check, whether there is a comment block above
x = 5
# FooBar
y = 6
and whether a comment is inline and after a line
y = 6 # FooBar
And after the codeblock I check if there is a trailing code block.
y = 6
# Foo
# Bar
<EOF>
This works as long as you do normal stuff and don't put comments in weird locations like this one: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/meson.build#L540 as the comment is in the middle of an expression and I currently don't break up many expressions. The comments below are readded, because visit_ArgumentsCall is probably adding those
But at least it is the only comment remove in this big meson file
So basically just more checks for comments is all that is missing?
A bit simplified, but yes, that's correct.
This pull request introduces 1 alert when merging 5522fe3bbfd38d0919c294479c6488d7dcb1642f into 97ec20e90142c229f62a6d20371f44df0b8dd41e - view on LGTM.com
new alerts:
- 1 for Unused import
Does it make sense to add a CI pass that does the following:
- Clone a few big repos, because more code means more coverage of existing constructs (E.g. systemd, mesa)
- Reformat all meson files + All comments must be readded
- Check that all files are parseable
- Check that it configures (Optional)
- Reformat all meson files and assert that they didn't change
As the formatter is platform-independent, it just would have to run on e.g. linux and e.g. only if the file formatter.py was changed
This pull request introduces 1 alert when merging 4be7fae872631990be7d4ee8609014358ccb5c95 into 611bd4e6dbbf4c88321bebe3b35e4bd3d9a0e5a2 - view on LGTM.com
new alerts:
- 1 for Unused import
@annacrombie what are your thoughts here?
I just checked and muon also deletes comments after a opening paren, e.g.
a = ( # comment
1
)
This is impossible handle with the strategy both muon and this PR use because you only have so many nodes to attach comments to, but you can comment anywhere inside a parenthesized expression. For example, this expression:
(a = (1+2))
corresponds roughly to the following AST:
paren
|
assignment
/ \
id paren
|
plus
/ \
1 2
Which gives us 7 nodes to attach comments to.
But it is entirely possible to write:
( # comment 1
# comment 2
# comment 3
a # comment 4
= # comment 5
# comment 6
( # comment 7
1 # comment 8
+ # comment 9
2 # comment 10
) # comment 11
# comment 12
) # comment 13
Which already greatly exceeds the amount of slots we have, and this example could technically have an unlimited amount of comments inside the () anyway.
muon partially solves this problem for lists, by adding another special formatting node fmt_eol
, which keeps newlines inside of [], {}, and () expressions (the standard parser throws them away), and special code in the list parser to handle them. Extending this handling to the general parser so that newlines could be handled anywhere inside a statement could be difficult, I'm not sure.
Unrelated, but @JCWasmx86, how would you feel about renaming some of the formatting options, e.g. kwa_ml
? I find it isn't a very intuitive name and would be happy to change it in muon.
This pull request introduces 2 alerts when merging a3be21867d69ada18f31dde399076d8234a091e4 into 8dfa55005e4f6386e54ca53909b66933e5930858 - view on LGTM.com
new alerts:
- 1 for Unused import
- 1 for Nested loops with same variable
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.
Thanks a lot for your thoughts. With those new commits (I will clean it up, as soon as #10972 is reviewed+merged, I was able to find those numbers:
Project name | Number of comments | Number of lost comments | Percentage of lost comments |
---|---|---|---|
systemd | 340 | 0 | 0.00% |
mesa | 3931 | 1 | 0.025% |
gnome-builder | 335 | 0 | 0.00% |
gstreamer | 1045 | 6 (*) | 0.57% |
gtk | 273 | 1 | 0.36% |
glib | 619 | 3 | 0.48 % |
As you can see from these huge meson-code bases (Assuming a lot of syntax is used), we only have a quite low error rate. For gstreamer, 36 of those are from one bug, the other 6 are more difficult to solve
(*) Those were 42, but I fixed a bug, so only 6 comments are lost now.
muon partially solves this problem for lists, by adding another special formatting node fmt_eol, which keeps newlines inside of [], {}, and () expressions (the standard parser throws them away), and special code in the list parser to handle them. Extending this handling to the general parser so that newlines could be handled anywhere inside a statement could be difficult, I'm not sure.
I have a lot of special cases, too in order to improve the formatting output, but I use more some heuristics and a bit of praying. But interesting approach, I will look into it.
But it is entirely possible to write:
I would say we shouldn't optimize for the 0.001% of insane code. Sure you could consider it a bug, but at the end you probably invest a few hours/days to have a similar formatting for this one insane case, while using this time would have benefited a huge majority. I think those really-really-edge cases aren't important right now.
Unrelated, but @JCWasmx86, how would you feel about renaming some of the formatting options, e.g. kwa_ml? I find it isn't a very intuitive name and would be happy to change it in muon.
Yes, just go ahead, I will follow you :)
And in addition to that, I did a few experiments regarding idempotency. After maximum 3 iterations every project i have tested did converge to one formatting, so it would be probably wise to do something like:
lines = ...
for i in range(0, MAX_ITERATIONS):
new_lines = format(lines)
if equal(lines, new_lines):
break
lines = new_lines
Sure it would cost a bit of performance, but I think that it is worth it
This pull request introduces 1 alert when merging 4fa74cb8905adb320d39002f6f3f7a26ada3d7e8 into 8dfa55005e4f6386e54ca53909b66933e5930858 - view on LGTM.com
new alerts:
- 1 for Unused import
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.
This pull request introduces 1 alert when merging b81eb654a0d6738daa59b6d93ce78b19fedf3f3d into 8ee4660788df678b64bb2986b5f2f8474c127ae2 - view on LGTM.com
new alerts:
- 1 for Unused import
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.
This pull request introduces 1 alert when merging 910c7c2206fc18ca441d07b8d559a5164a5f62da into 8ee4660788df678b64bb2986b5f2f8474c127ae2 - view on LGTM.com
new alerts:
- 1 for Unused import
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.
This pull request introduces 1 alert when merging f0ed332ed8bf2e68d659b5173b8b38e4770967b0 into bfc813200c8c4401985151216390f860246e0d71 - view on LGTM.com
new alerts:
- 1 for Unused import
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.
Codecov Report
Merging #10971 (bfc8132) into master (8ee4660) will not change coverage. The diff coverage is
100.00%
.
:exclamation: Current head bfc8132 differs from pull request most recent head df4d72a. Consider uploading reports for the commit df4d72a to get more accurate results
@@ Coverage Diff @@
## master #10971 +/- ##
=======================================
Coverage 68.58% 68.58%
=======================================
Files 412 412
Lines 87861 87861
Branches 20728 20728
=======================================
Hits 60261 60261
Misses 23093 23093
Partials 4507 4507
Impacted Files | Coverage Δ | |
---|---|---|
mesonbuild/interpreter/interpreter.py | 84.58% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
This pull request introduces 1 alert when merging df4d72a5b6febe0c22107f0c79425de9a192db96 into bfc813200c8c4401985151216390f860246e0d71 - view on LGTM.com
new alerts:
- 1 for Unused import
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine :gear: that powers LGTM.com. For more information, please check out our post on the GitHub blog.
Add a script for automatically formatting a project, counting the number of lost comments and how long it takes to converge to one formatting, example results:
systemd has 340, lost 0 during formatting (0.0%)
systemd needed 2 formatting passes to be stable
mesa has 3931, lost 1 during formatting (0.025%)
mesa needed 3 formatting passes to be stable
gnome-builder has 335, lost 0 during formatting (0.0%)
gnome-builder needed 2 formatting passes to be stable
gstreamer has 1338, lost 6 during formatting (0.448%)
gstreamer needed 2 formatting passes to be stable
gtk has 273, lost 1 during formatting (0.366%)
gtk needed 2 formatting passes to be stable
glib has 619, lost 3 during formatting (0.485%)
glib needed 2 formatting passes to be stable
Found 6836 comments, but lost 11 during formatting (0.161%)
This shows that probably after five passes the vast majority of files won't change
I think that a formatter that has the potential to loose the content of the source files is a terrible idea. It would be acceptable if it would misplace the comments slightly but loosing them is not acceptable IMO. But then, I don't share the excitement for automatic code formatters that some people seem to be subject to.
@dnicolodi I agree with you. But as you can set comments everywhere, as demonstrated above, readding all at the right place would be probably near impossible.
At the moment, only 0.161% of all comments are lost, or in other words 1 comment in 621 comments is lost.
One thing this formatter could do, is to maintain some mapping of line_in_input -> line_in_output
, e.g. the statement that was in line 54 is now in line 57. And if there is e.g. some comment left, it could attempt to sprinkle it around. E.g. there was a comment on line 54, too, so simply place it at the end of line 57 in the output.
What do you think of that?
After thinking a little bit harder about the problem, I think I came up with a decent solution for comment preservation in muon, and maybe you can use it too:
- Any time a comment token is encountered in the token stream, append it to an array of comments attached to the most recently parsed node.
- During formatting, after outputting every node, check if there were comments attached to it and print them all out one by one.
This means that at least no comments are lost (test case).
About convergance, I really think you should strive to have the formatter converge after only 1 pass, but I suppose that could be left for future work.
It might also be a good idea to go ahead and format all the meson source files in this repo, and then add some CI to enforce that they stay formatted, but that should probably be a separate PR.
Oh yeah I also updated the name of kwa_ml
to kwargs_force_multiline
.
Any time a comment token is encountered in the token string, append it to an array of comments attached to the most recently parsed node.
That's a quite good idea. I will try it out. But if the changes in the parser are too big I probably must split it up in yet another PR that is a dependency of this PR then, as this PR is already an insane amount of changes to review (~1300LOC)
About convergance, I really think you should strive to have the formatter converge after only 1 pass
As long as we do multiple passes only internally you can't observe from the outside that we needed multiple passes, so it's not that horrible IMO.
Oh yeah I also updated the name of kwa_ml to kwargs_force_multiline.
Ok, cool I will change it here, too, thanks for the notice
Any time a comment token is encountered in the token stream, append it to an array of comments attached to the most recently parsed node.
Regarding this I did experiment a little bit, but that would require rewriting the majority of the parser, as the comments are already lost when the parser is called, as these are stripped out by the lexer.
And doing parsers/lexers is far outside of my skillset and that's something I can't spend my time on learning it.