sdk icon indicating copy to clipboard operation
sdk copied to clipboard

Parent covers children invariant is not respected

Open m09 opened this issue 6 years ago • 7 comments

With the concept of leading comments in the UAST, the simple invariant of parent covering their children is not respected.

I think this might be non-intuitive to many users.

To reproduce:

// This is a comment

var a = 1;

The initial comment is a child of the expression assignment node (on current bblfsh web for example).

m09 avatar Mar 28 '19 17:03 m09

It depends on the parsing mode you use.

For Native and Annotated, the AST is not normalized, thus it preserves any form that the JS parser emits. According to your comment, it looks like JS parser does not preserve this invariant.

For Semantic mode this invariant usually is not respected either, but for a different reason. All languages have different AST shapes and the (normalized) UAST structure may break the invariant by moving nodes around.

But for your specific case, this issue is caused by https://github.com/bblfsh/javascript-driver/issues/74 - the UAST pipeline does not process comments correctly, and they are preserved in the same places as in Native mode.

dennwc avatar Mar 28 '19 17:03 dennwc

Thanks for the clarifications.

m09 avatar Mar 28 '19 18:03 m09

@dennwc @creachadair I wouldn't really close this one. Although I do understand the implemented logic,

  1. Somebody has to calculate the "correct" node spans and currently, it is us, the ML team.
  2. This behavior is confusing a common user. Tested on all 7 of us, and additionally on Martin's research group in KTH Stockholm. Everybody wants the "correct" positions in Semantic mode.
  3. Nothing prevents us from fixing the positions at the end of the normalization, e.g. by doing an extra pass over all the nodes. Scales linearly.

vmarkovtsev avatar Apr 02 '19 12:04 vmarkovtsev

@vmarkovtsev You have to realize that nodes have a totally different order in Semantic. You just can't have a single universal representation and still have the "correct" node hierarchy in regards to positions for all languages. At least with the tree structures.

It is possible though if we switch to graph representation (https://github.com/bblfsh/sdk/issues/339) because you will be able to jump from Semantic nodes to Native and get positions and the "correct" hierarchy.

@creachadair @bzz As I mentioned in last couple of months, the issues with the current representation (tree structure) start to actually matter. I think we should bump the priority for the transition to the new representation.

dennwc avatar Apr 02 '19 12:04 dennwc

Actually, we are likely to use the Annotated mode in our current analyzers, because we need to reconstruct the original token stream byte-to-byte.

vmarkovtsev avatar Apr 02 '19 13:04 vmarkovtsev

Everybody wants the "correct" positions in Semantic mode.

You mentioned Semantic, so I focused the answer on it.

Actually, we are likely to use the Annotated mode in our current analyzers, because we need to reconstruct the original token stream byte-to-byte.

Right, Annotated will work better for this use case, but again, it has a similar issue - some AST does not provide a "correct" hierarchy. This time we cannot fix it because we cannot modify the structure in this mode by definition.

We really need a way to link those trees (+ token stream) in an arbitrary way. I will dedicate some time this week to outline the proposal in regards to the new representation (graphs). It will solve all those issues.

dennwc avatar Apr 02 '19 13:04 dennwc

Reopening and moving to SDK.

dennwc avatar Apr 02 '19 13:04 dennwc