Add comments to the AST
Background
Tool such as document generators and code formatters also need to care about comments.
- Document generators need to associate documenting comments with right syntactic elements.
- Code formatters need to preserve comments and keep their association with the original syntactic element.
To achieve this the parser shall also extract comments into the AST.
Terminology
There are two types of comments leading and trailing. An element may have several leading comments and at most one trailing comment.
-- A leading comment associated with foo
-- Another leading comment associated with foo
signal foo : natural; -- A trailing comment associated with foo
-- A leading comment associated with bar
signal bar : natural;
Unnatural comments
There are plenty of opportunities for placing comments in unnatural places:
if foo
-- Unnatural comment but legal
then
end
-- Unnatural comment but legal
if
-- Unnatural comment but legal
;
I think the acceptable solution for such comments are:
- Moving them with warning
- Ignoring them with warning
Not acceptable solutions would be:
- Moving them without warning
- Ignoring them without warning
Implementation concerns
There are several possible solutions:
-
It would be possible to add a token kind for comments, leading and trailing comments must be distinguishable. These token need to be handled differently than other tokens otherwise the cost will huge as all parsing functions must be able to handle comments everywhere. For example if a parsing function today expects a
SemiColonit must be able to ignored a comment. Thus theTokenStreammust have dedicated methods to optionally expect a leading or a trailing comment. -
A better approach might be to add a method to the
TokenStreamto extract the leading and trailing comments for the last token. TheTokenStreamcould keep track of comments which had not been extracted and thus were ignored. For each of these ignored comments a warning could be emitted after parsing is completed. Thus support for comments can be added incrementally and comments not handled will be explicitly flagged as ignored. -
A variant of 2. is to attach the comments as metadata to the
Tokenstructs. This is very clean as the comments are directly associated with the token. There would still need to be a mechanism to detect comments which where not handled. I think this can be handled by storing a reference in theTokenstruct to an external storage where the comment remains if it was not extracted. Thus ignored comments will remain in the external structure which can be traversed after parsing to print warnings for ignored comments.
I had naively started to implement (1) but I like your solution (3) much better. I'll let you know once I've got something worth looking at.
Thinking about how to deal with comments at the end of the file. There won't be a token that it makes sense to associate them with. I'll just create a FinalComment token. Since this can only occur in one place, the parsing change will be minor.
I've been thinking about this some more (and trying to take the next step to moving the comments from the tokens to the AST) and I'm not happy about how I have implemented it so far. In the current implementation I assigned all comments to a token. Most positions don't make any sense and we need to gather up all those comments to either move them or emit warnings for them. This gathering gets messy.
I would like to remove the comments from the tokens and implement the following methods on the Tokenizer:
// Gets the leading comments for the next token.
pub fn get_leading_comments(&mut self) -> Vec(Comment);
// Gets the trailing comment for the previous token.
pub fn get_trailing_comment(&mut self) -> Option(Comment);
// Returns a vector of all the comments that were parsed since we last
// handled. Does not include comments that were
// retrieved using get_leading_comments or get_trailing_comment. This
// is useful if we wish to move all these comments to a new location or to
// generate warning messages from them.
pub fn handle_new_comments(&mut self) -> Vec(Comment);
So if we were parsing a type_definition:
- Assert that there aren't any already parsed comments that haven't been handled.
- Get the leading comments for the definition.
- Parse the definition.
- Get the trailing comment.
- Get the comments that weren't handled. Either combine them into the other comments, or ignore them. Generate warning messages either way.
- Add the combined comment (probably a latin_1_str) into the type_definition object.
The main complexity is dealing with set_state since that should modify the vector of unhandled comments. We could replace set_state with move_backwards which just drops any unhandled comments from the list if their position is ahead of the new state. The other main use of set_state is in move_after(token) which seems to be exclusively used for moving after a peeked token. This could be replaced with move_after_peek() which just moves after the token previously peeked. The state and unhandled comments could be stored in the Tokenizer whenever a peek is performed.
Does this seem like a reasonable approach?
I agree with the approach.
I've reworked the comment parsing along these lines in #30. I'm still working on improving the tests and getting the comments into the AST.
@benreynwar what has happened with the PR, what is the current status?
It's abandoned. I'm trying to focus my 'fun work' time on ghdl now.