Handle nested scopes better when parsing code for syntax highlighting

Open DanTup opened this issue 3 years ago • 1 comments

@bkonyi I'm not ready to land this yet (I want to do some more manual testing, and compare the results to an npm tool that produces the same), but it ended up much larger than I'd planned so I thought it made sense to raise open for some feedback.

The goal here was to resolve some issues with overlapping/nested tokens. Previously we would produce nested tokens like this:

>  print('the value of \$i is $i');
#  ^^^^^ entity.name.function.dart
#        ^^^^^^^^^^^^^^^^^^^^^^^^ string.interpolated.single.dart
#                      ^^ string.interpolated.single.dart constant.character.escape.dart
#                              ^ string.interpolated.single.dart variable.parameter.dart
#                                 ^ punctuation.terminator.dart

The goal is to produce a simple flat list (which is consistent with another npm tool I use, allowing us to compare the output as an extra test). The new output for the above is:

>  print('the value of \$i is $i');
#  ^^^^^ entity.name.function.dart
#        ^^^^^^^^^^^^^^ string.interpolated.single.dart
#                      ^^ string.interpolated.single.dart constant.character.escape.dart
#                        ^^^^^^ string.interpolated.single.dart
#                              ^ string.interpolated.single.dart variable.parameter.dart
#                               ^ string.interpolated.single.dart
#                                 ^ punctuation.terminator.dart

Previously we would use the stack only to store the string scope names which made it difficult to produce the spans "in between" when popping the nested tokens. So I change this so the stack now drives all of the produced tokens. Instead of producing tokens directly, we always push/pop from the stack on boundaries (although there's an add() helper that just pushes and pops for when we want to produce an exact token). This allows us to produce those "in between" tokens (because any time we push a new scope, we can produce a token for the gap between the last token and the new token).

I added a new class to make it easier to pass around a "location" (an offset position, line number, and column number) because we now need to know the line/col for the ends of tokens to support the functionality above (when we pop a token, we need to produce the next token starting at the previous tokens end).

I also moved the two remaining tests to use the "golden" format for convenience (since the results of one changed slightly with this change because it had nested tokens). I kept the source for them in the test file because our bots run dart analyze and the code is invalid, but the goldens now live in the goldens folder along with the rest.

I'll aim to test this more (and compare to the npm tool) tomorrow. If you can think of any cases this might not handle correctly, let me know and I can add additional tests (first in master so we can get the previous results for them, and then rebase this so we can see the impact of the changes).

Aug 09 '22 16:08 DanTup

I've pushed some additional minor changes, and done some more manual tested in DevTools and comparisons to the NPM tool. With a little cleanup (because the NPM tool doesn't always do some things that we now do, such as avoiding adjacent identical tokens) the results for a some large folders from the analysis server are the same for both.

We're still on a slightly older version of the grammar here though, so I'm going to work on updating to the latest one (which may require some further changes) in a separate PR.

Aug 10 '22 11:08 DanTup