This implements AST comments and updates the formatter to look for comments in the AST instead of a separate comments list.

This adds a new :include_comments to Code.string_to_quoted/2 and Code.string_to_quoted!/2.

Comments are attached to AST nodes in three variants:

:leading_comments: Comments that are located before a node, or in the same line.

Examples:

# This is a leading comment
foo # This one too

:trailing_comments: Comments that are located after a node, and before the end of the parent enclosing the node(or the root document).

Examples:

foo
# This is a trailing comment
# This one too

:inner_comments: Comments that are located inside an empty node.

Examples:

foo do
  # This is an inner comment
end

[
  # This is an inner comment
]

%{
  # This is an inner comment
}

A comment may be considered inner or trailing depending on wether the enclosing node is empty or not. For example, in the following code:

foo do
  # This is an inner comment
end

The comment is considered inner because the do block is empty. However, in the following code:

foo do
  bar
  # This is a trailing comment
end

The comment is considered trailing to bar because the do block is not empty.

In the case no nodes are present in the AST but there are comments, they are inserted into the root :__block__ node as :inner_comments.

Many of the ideas here come from Sourceror, which takes inspiration for the recast parser. In recast, comments are attached to nodes with the leading and trailing boolean flags. If a comment has both attributes set to false, it represents an "inner" comment. For the sake of clarity I decided to define it explicitly for the Elixir AST. An additional difference is that recast stores all the comments(leading, trailing, inner) mixed together under a single comments field. I find that separating them to be cleaner and easier to work with, since you don't have to split or filter the list to get a certain kind of comment from a node.

Different from Sourceror which supports only :leading_comments and :trailing_comments, it introduces :inner_comments to remove the ambiguity between comments before the end keyword, and comments after an expression. This also becomes important when working on the formatter code, as it is very awkward to inject inner comments after the call args were already converted to algebra docs. The Sourceror implementation relies on the formatter behavior to produce ast and comments that the formatter would interpret correctly, meaning it can get away with a sub-optimal comment merging strategy; it only needs the formatter to work. This PR on the contrary tries to attach comments where they make the most sense, and updates the formatter accordingly instead of wrangling with comment line numbers.

The current implementation in this PR adds an extra traversal step after the code is initially parsed. Due to this, and due to the logic to merge comments and AST requiring precise line information in every AST nodes, it requires to also set up the following options:

[
  literal_encoder: &{:ok, {:__block__, &2, [&1]}},
  token_metadata: true
]

TODO

[x] Add public documentation
[x] Handle formatting of binary operator forms and binary operator chains
[ ] Fix cases from the formatter integration tests
[ ] Run the formatter on the Elixir codebase itself to ensure nothing changes

Mar 21 '25 18:03 doorgan

There's a few issues I still need to tackle and a few more cleanups are pending(mostly around the code for formatting interpolations and the repetition of comment docs generation in general), but I think this is generally ready for an initial round of reviews

There are 3 tests currently failing that are related to binary operators:

  1) test case with when and clause comment (Code.Formatter.IntegrationTest)
     test/elixir/code_formatter/integration_test.exs:298
     Assertion with == failed
     code:  assert IO.iodata_to_binary(Code.format_string!(good, opts)) == String.trim(good)
     left:  "case decomposition do\n  <<h, _::binary>> when h != ?< ->\n    decomposition =\n      decomposition\n      |> :binary.split(\" \", [:global])\n      |> Enum.map(&String.to_integer(&1, 16))\n\n    Map.put(dacc, String.to_integer(codepoint, 16), decomposition)\n\n  _ ->\n    dacc\nend"
     right: "case decomposition do\n  # Decomposition\n  <<h, _::binary>> when h != ?< ->\n    decomposition =\n      decomposition\n      |> :binary.split(\" \", [:global])\n      |> Enum.map(&String.to_integer(&1, 16))\n\n    Map.put(dacc, String.to_integer(codepoint, 16), decomposition)\n\n  _ ->\n    dacc\nend"
     stacktrace:
       test/elixir/code_formatter/integration_test.exs:299: (test)

  2) test comment inside operator with when (Code.Formatter.IntegrationTest)
     test/elixir/code_formatter/integration_test.exs:590
     Assertion with == failed
     code:  assert IO.iodata_to_binary(Code.format_string!(bad, opts)) == result
     left:  "raise # Comment\n      function(x) ::\n        any"
     right: "# Comment\nraise function(x) ::\n        any"
     stacktrace:
       test/elixir/code_formatter/integration_test.exs:597: (test)

  3) test first argument in a call without parens with comments (Code.Formatter.IntegrationTest)
     test/elixir/code_formatter/integration_test.exs:468
     Assertion with == failed
     code:  assert IO.iodata_to_binary(Code.format_string!(good, opts)) == String.trim(good)
     left:  "with bar ::\n       :ok\n       | :invalid\n       | :other"
     right: "with bar ::\n       :ok\n       | :invalid\n       # | :unknown\n       | :other"
     stacktrace:
       test/elixir/code_formatter/integration_test.exs:469: (test)

Once those are fixed, running make format should produce the same results as the formatter would before this change. I found a few scenarios where newlines may be added/removed for some trailing comments but that issue should be relatively easy to fix once I figure it out.

Apr 01 '25 02:04 doorgan

Fantastiac!

Btw, number #2, the code you are generating is semantically wrong.

raise # Comment
         function(x) :: any

is equivalent to:

raise()
function(x) :: any

Just in case it was not clear :) for functions without parens, it is probably safer to move it outside

Apr 01 '25 09:04 josevalim

Yep, that case is definitely wrong. One subtlety of dealing with binary_op formatting is that in general we want to hoist the operand comments to be above the operator, but if the operator is nested in a function call(in this case raise) then it's harder to produce the comment above the parent doc without also changing the code that handles the parent node. That's what is causing that broken formatting in 2.

Apr 01 '25 13:04 doorgan

elixir
elixir copied to clipboard

Add comments into the AST

TODO

elixir elixir copied to clipboard

Add comments into the AST

TODO

elixir
elixir copied to clipboard