tree-sitter-nu icon indicating copy to clipboard operation
tree-sitter-nu copied to clipboard

fix: empty _command_list_body and list_body

Open blindFS opened this issue 1 month ago • 6 comments

Make it easy for https://github.com/blindFS/topiary-nushell/issues/39#issuecomment-3568844310

Also fixes a minor issue of empty command_list

Although this PR introduces 6 more conflicts, it hardly hurts lex state counts and WASM comp time, should be safe to merge.

blindFS avatar Nov 25 '25 04:11 blindFS

@mkatychev, sorry to ping you on an irrelevant PR. I'm confused at the failed CI pipeline, do you happen to know what's going on?

blindFS avatar Nov 25 '25 05:11 blindFS

I'll look more into it tomorrow but I've encountered similar messages and it was an issue with the package-lock.json

mkatychev avatar Nov 25 '25 05:11 mkatychev

@blindFS I finally narrowed down the problem, until https://github.com/tree-sitter/node-tree-sitter/pull/258 is merged, node isn't compatible with 0.25 basically since there's breaking ABI changes that haven't been addressed.

So either temporarily drop node support or temporarily downgrade the ABI.

mkatychev avatar Nov 25 '25 18:11 mkatychev

@blindFS I finally narrowed down the problem, until tree-sitter/node-tree-sitter#258 is merged, node isn't compatible with 0.25 basically since there's breaking ABI changes that haven't been addressed.

So either temporarily drop node support or temporarily downgrade the ABI.

Thanks a lot, I'll disable test-node for now.

BTW, it will be super helpful if you review this PR in your spare time.

A lot of conflicts for this minor edge case is kinda silly, but I really run out of ideas here.

The basic idea behind this PR is that, previously, there's a list_body node in empty lists like

[
,
]

which covers the commas and newlines between brackets, it's somewhat annoying for topiary, so I make this kind of empty body contents anonymous in this PR.

The general_body_rule function helps to generate such kind of repeated patterns with separators in between, especially helpful when extra newlines are allowed.

blindFS avatar Nov 26 '25 00:11 blindFS

ping me when you're ready to land it.

fdncred avatar Nov 26 '25 00:11 fdncred

ping me when you're ready to land it.

Sure

blindFS avatar Nov 26 '25 01:11 blindFS

@blindFS I'll post my general comments initially:

  • getting rid of the general_body_rules function would help a lot with indirection
  • adding newlines to extras would remove the need for most of this PR (ignoring regressions in other rules ofc)[^1].

I managed to greatly simplify the $._collection_body node while adding additional support for commas inside of collection types without regressions in https://github.com/nushell/tree-sitter-nu/pull/235:

  • https://github.com/nushell/tree-sitter-nu/blob/32cd5db1b3092636249b2df4a9369201c8b13174/grammar.js#L316-L316

In general handling whitespace explicitly (the choice(punc().comma, /\s/) below) does not need to be done as tree sitter nodes[^0] consider whitespace delimitation between nodes as opt-out [^1]:

  • https://github.com/nushell/tree-sitter-nu/blob/47d4b4f5369c0cae866724758ae88ef07e10e4f1/grammar.js#L1106-L1106

With the whitespace mention above general_body_rules does not feel justified and removing it would greatly simplify a lot of the rulesets (in my opinion).

Ideally a list node should have a simple definition: list: ($) => seq('[', repeat1(choice( ',', $._list_entry, '\n'))),

...this could further be reduced by keeping newlines as part of extras and having their presence define rules[^2] (such as end of function or statement) be explicitly opt in: extras: ($) => [/s/, $.comment],

[^0]: named or anonymous (like ,) [^1]: https://tree-sitter.github.io/tree-sitter/creating-parsers/3-writing-the-grammar.html#using-extras [^2]: token.immediate(rule)

mkatychev avatar Nov 26 '25 19:11 mkatychev

@blindFS I finally narrowed down the problem, until tree-sitter/node-tree-sitter#258 is merged, node isn't compatible with 0.25 basically since there's breaking ABI changes that haven't been addressed. So either temporarily drop node support or temporarily downgrade the ABI.

Thanks a lot, I'll disable test-node for now.

BTW, it will be super helpful if you review this PR in your spare time.

A lot of conflicts for this minor edge case is kinda silly, but I really run out of ideas here.

The basic idea behind this PR is that, previously, there's a list_body node in empty lists like

[
,
]

which covers the commas and newlines between brackets, it's somewhat annoying for topiary, so I make this kind of empty body contents anonymous in this PR.

The general_body_rule function helps to generate such kind of repeated patterns with separators in between, especially helpful when extra newlines are allowed.

In topiary's case I think turning an [ , ] into [] ( through @delete) would be reasonable (I'm assuming that's what you're trying to do) and should not conflict with some changes.

mkatychev avatar Nov 26 '25 19:11 mkatychev

@mkatychev Thanks a lot for the reviewing. The removing of newline in extras #139 does seem to cause a lot of trouble.

As I recall, it mainly solves the issue of multiline binary-op.

That issue might be solvable with some careful tuning of precedence. I'll try it on the old grammar.js before #139 but I feel pessimistic about it. I'll let you know if I run into troubles.

blindFS avatar Nov 27 '25 01:11 blindFS

Please let me know if you have issues, you can generally get away with this kind of thing without resorting to scanner.c so long as you don't maintain state (unlike something like python where indentation is "stateful").

mkatychev avatar Nov 27 '25 01:11 mkatychev

Ideally a list node should have a simple definition: list: ($) => seq('[', repeat1(choice( ',', $._list_entry, '\n'))),

You mean getting rid of list_body? That will be a breaking change affecting many downstream projects. And we should do the same for all xxx_body nodes.

blindFS avatar Nov 27 '25 01:11 blindFS

@mkatychev Oh, I think I remember what was the problem:

There's a fundamental difference between

(foo
bar # bar is the argument of command foo
)

foo
bar # a different command

It seems the only way to differentiate them is to specify newlines in an explicit way.

Two possibilities:

Explicit \n in parenthesized rules to force continuation of the matching of current pipe-element

That's what it looked like before #139, It makes everything with higher precedence difficult, for example:

(
  ls | where $it.name != "foo"
  and $it.name != "bar"
)

Explicit \n in non-parenthesized blocks as a terminator between piplelines

This probably is what you would expect. However I have trouble making this work as expected

1
# comment
| $in

The best bet I'm aware of is moving the $._pipe_separator rule to scanner.c so it can have a higher precedence than the \n in $._terminator, but still the comments in-between are pretty annoying. Any idea?

blindFS avatar Nov 27 '25 03:11 blindFS

I think moving terminator to the external scanner is a simple and feasible way. Closing this for now.

blindFS avatar Nov 28 '25 13:11 blindFS

I see your case, I think the terminator case is most similar to python/justfile indentation:

justfile scanner.c python scanner.c

mkatychev avatar Dec 01 '25 21:12 mkatychev

Hi @mkatychev, I've managed to fix most of the _terminator related problems mentioned above in this branch

Then I found another trickier problem:

To prevent echo 192.168.0.1 from being parsed as a command with 4 numerical arguments. I think I don't have much options but to put explicit spaces as separators between command arguments. (Moving unquoted rules to ex scanner might be another option, but that's huge amount of work, bc all conflicted rules must be moved too)

However the spaces introduced in explicit JS rules always have higher precedence than those of extras, even with token(prec(lowest_num, /[ \t]+/)). (I think tree-sitter should allow finer-grained control over that kind of stuff.)

Then the following code will be parsed incorrectly, as a block with single command of 3 string arguments:

{
  print foo
  print foo
}

BC the spaces before the 2nd print trick the parser to continue with the command_rule before trying to match the newline with _terminator.

blindFS avatar Dec 08 '25 03:12 blindFS

Would you mind elaborating on echo 192.168.0.1? I'm confused as to how full stops are involved

mkatychev avatar Dec 08 '25 15:12 mkatychev

Would you mind elaborating on echo 192.168.0.1? I'm confused as to how full stops are involved

Actually I'm not quite sure how it ended up like that (after conflict resolving), but the parser thought 192, .168, .0 and .1 are 4 number arguments.

blindFS avatar Dec 09 '25 00:12 blindFS