Background

Tool such as document generators and code formatters also need to care about comments.

Document generators need to associate documenting comments with right syntactic elements.
Code formatters need to preserve comments and keep their association with the original syntactic element.

To achieve this the parser shall also extract comments into the AST.

Terminology

There are two types of comments leading and trailing. An element may have several leading comments and at most one trailing comment.

-- A leading comment associated with foo
-- Another leading comment associated with foo
signal foo : natural; -- A trailing comment associated with foo
-- A leading comment associated with bar
signal bar : natural;

Unnatural comments

There are plenty of opportunities for placing comments in unnatural places:

if foo 
-- Unnatural comment but legal
then
end 
-- Unnatural comment but legal
if
-- Unnatural comment but legal
;

I think the acceptable solution for such comments are:

Moving them with warning
Ignoring them with warning

Not acceptable solutions would be:

Moving them without warning
Ignoring them without warning

Implementation concerns

There are several possible solutions:

It would be possible to add a token kind for comments, leading and trailing comments must be distinguishable. These token need to be handled differently than other tokens otherwise the cost will huge as all parsing functions must be able to handle comments everywhere. For example if a parsing function today expects a SemiColon it must be able to ignored a comment. Thus the TokenStream must have dedicated methods to optionally expect a leading or a trailing comment.
A better approach might be to add a method to the TokenStream to extract the leading and trailing comments for the last token. The TokenStream could keep track of comments which had not been extracted and thus were ignored. For each of these ignored comments a warning could be emitted after parsing is completed. Thus support for comments can be added incrementally and comments not handled will be explicitly flagged as ignored.
A variant of 2. is to attach the comments as metadata to the Token structs. This is very clean as the comments are directly associated with the token. There would still need to be a mechanism to detect comments which where not handled. I think this can be handled by storing a reference in the Token struct to an external storage where the comment remains if it was not extracted. Thus ignored comments will remain in the external structure which can be traversed after parsing to print warnings for ignored comments.

Dec 08 '18 08:12 kraigher

I had naively started to implement (1) but I like your solution (3) much better. I'll let you know once I've got something worth looking at.

Dec 08 '18 22:12 benreynwar

Thinking about how to deal with comments at the end of the file. There won't be a token that it makes sense to associate them with. I'll just create a FinalComment token. Since this can only occur in one place, the parsing change will be minor.

Dec 09 '18 00:12 benreynwar

I've been thinking about this some more (and trying to take the next step to moving the comments from the tokens to the AST) and I'm not happy about how I have implemented it so far. In the current implementation I assigned all comments to a token. Most positions don't make any sense and we need to gather up all those comments to either move them or emit warnings for them. This gathering gets messy.

I would like to remove the comments from the tokens and implement the following methods on the Tokenizer:

// Gets the leading comments for the next token.
pub fn get_leading_comments(&mut self) -> Vec(Comment);
// Gets the trailing comment for the previous token.
pub fn get_trailing_comment(&mut self) -> Option(Comment);
// Returns a vector of all the comments that were parsed since we last
// handled.  Does not include comments that were
// retrieved using get_leading_comments or get_trailing_comment.  This
// is useful if we wish to move all these comments to a new location or to
// generate warning messages from them.
pub fn handle_new_comments(&mut self) -> Vec(Comment);

So if we were parsing a type_definition:

Assert that there aren't any already parsed comments that haven't been handled.
Get the leading comments for the definition.
Parse the definition.
Get the trailing comment.
Get the comments that weren't handled. Either combine them into the other comments, or ignore them. Generate warning messages either way.
Add the combined comment (probably a latin_1_str) into the type_definition object.

The main complexity is dealing with set_state since that should modify the vector of unhandled comments. We could replace set_state with move_backwards which just drops any unhandled comments from the list if their position is ahead of the new state. The other main use of set_state is in move_after(token) which seems to be exclusively used for moving after a peeked token. This could be replaced with move_after_peek() which just moves after the token previously peeked. The state and unhandled comments could be stored in the Tokenizer whenever a peek is performed.

Does this seem like a reasonable approach?

Jan 16 '19 01:01 benreynwar

I agree with the approach.

Jan 16 '19 07:01 kraigher

I've reworked the comment parsing along these lines in #30. I'm still working on improving the tests and getting the comments into the AST.

Jan 30 '19 04:01 benreynwar

@benreynwar what has happened with the PR, what is the current status?

Jul 03 '20 16:07 m-kru

It's abandoned. I'm trying to focus my 'fun work' time on ghdl now.

Jul 03 '20 22:07 benreynwar

Add comments to the AST

Background

Terminology

Unnatural comments

Implementation concerns