sdk icon indicating copy to clipboard operation
sdk copied to clipboard

[feature] Include the information about full tokens

Open vmarkovtsev opened this issue 7 years ago • 6 comments

Strings, comments, etc. have their Token set to the inner value of the token. E.g. in Python "hello" has Token hello (no quotes). This is all good and logical.

However, we discard information about the real token - quote characters, comment characters, etc. It is needed to reproduce the original source code from a UAST. I have two possible solution proposals:

  1. Add "FullToken" for those nodes which need it.
  2. Add "TokenPrefix" and "TokenSuffix".

vmarkovtsev avatar Aug 14 '18 09:08 vmarkovtsev

Comments should have the character used, prefix and suffix in the semantic UAST "Comment" object. For strings, at least in the Python and Ruby drivers, unfortunately the native AST doesn't provide the string type so this won't be possible for all drivers unless we parse the source code ourselves.

I'll leave this open just in case we find a workable solution in the future.

juanjux avatar Aug 14 '18 09:08 juanjux

The current workaround is simple: I look at the difference between file_contents[start_position.offset:end_position.offset] and Token and record prefixes and suffixes.

vmarkovtsev avatar Aug 14 '18 10:08 vmarkovtsev

Token as a concept won't work in the long run, so I think we should provide a helper that selects a source file content based on positions of nodes, as @vmarkovtsev mentioned.

For example, what is the token of do ... while? This will get more and more complex once we start working with semantic concepts for classes.

dennwc avatar Aug 14 '18 10:08 dennwc

They work pretty well... for identifiers and literals. For statements and reserved words, as you proved, they're problematic (same happens with "from x import y" in Python which is a single node with children).

Maybe we should make a distinction between a token and a representation.

juanjux avatar Aug 14 '18 10:08 juanjux

The token is something that exists in the source code, Egor mentioned a few times that he expects tokens to be valid for all node types, which cannot be the case with the current model.

I would rather go with semantic concepts, so Comments have text, prefix, etc and String (literal) has a value and quotes. Tokens can be provided with positional info. Since UAST v2 allows more than 2 positional fields, we can define few more to represent start/end positions of different keywords in the statement.

dennwc avatar Aug 14 '18 10:08 dennwc

Even with semantic objects it would be nice to keep the concept either as a single unified name or as some kind of field metadata so XPath queries doesn't have to match every semantic object to retrieve a different field in each which happens now as @smacker said the other day.

juanjux avatar Aug 14 '18 10:08 juanjux