tree-sitter icon indicating copy to clipboard operation
tree-sitter copied to clipboard

Hidden Rules, Index, and Field Names

Open tpeacock19 opened this issue 3 years ago • 4 comments

Hello,

I've run into an issue that I cannot seem to find a good workaround for. I'm trying to build a grammar for a lisp, emacs lisp specifically, and when constructing lists I would like to keep the delimeters out of the syntax tree. However, If I do so then there is no reliable way to obtain field name data as that index seems to include hidden nodes. I believe this was spoken to some in https://github.com/tree-sitter/tree-sitter/issues/1526.

Heres a simplified version of my grammar:

const WHITESPACE_CHAR = /[\n\s]/;

const WHITESPACE = token(repeat1(WHITESPACE_CHAR));

module.exports = grammar({
  name: "elisptest",

  rules: {
    source_file: ($) => repeat(choice($._form, WHITESPACE)),

    _form: ($) => choice($.list, $.vector, $.symbol),
    symbol: ($) => /[a-zA-Z-]+/,

    _paren_open: ($) => token.immediate("("),
    _paren_close: ($) => token.immediate(")"),
    _bracket_open: ($) => token.immediate("["),
    _bracket_close: ($) => token.immediate("]"),

    list: ($) =>
      seq(
        field("open", $._paren_open),
        field("value", repeat($._form)),
        field("close", $._paren_close)
      ),
    vector: ($) =>
      seq(
        field("open", $._bracket_open),
        field("one", $.symbol),
        field("two", $.symbol),
        field("three", $.symbol),
        field("close", $._bracket_close)
      ),
  },
});

Issues:

Hidden Index

The grammar for vectors above with the string [foo bar world] outputs the below. Since I am unable to use the C api to identify the correct index for the nodes the fields are off. I know that I can just +1 to the index for vectors/lists but it does not seem to be a real solution.

Syntax tree:
(vector one: (symbol) two: (symbol) three: (symbol))

first node: (symbol)
	field: open
second node: (symbol)
	field: one
third node: (symbol)
	field: two

Repeat rule:

With the list grammar above (foo bar world star) would be the below. It does not seem like the parser identifies the repeat rule and gives the subsequent nodes an index. However, the syntax tree does reflect this correctly.

Syntax tree:
(list value: (symbol) value: (symbol) value: (symbol) value: (symbol))

first node: (symbol)
	field: open
second node: (symbol)
	field: value
third node: (symbol)
	field: close
fourth node: (symbol)
	field: (null)

As you can see the syntax tree does reflect the correct values. Ideally for the first issue there would be a named version of ts_node_field_name_for_child similar to ts_node_child & ts_node_named_child. However, I'm not sure of how to solve for the second. Please let me know if there is any better or more appropriate way to build my grammar. Thanks!

tpeacock19 avatar Sep 10 '22 23:09 tpeacock19

Unrelated to the question, but are you aware of this? https://github.com/Wilfred/tree-sitter-elisp

sogaiu avatar Sep 11 '22 08:09 sogaiu

Yes, I've actually started just forming that and making modifications here https://github.com/tpeacock19/tree-sitter-elisp. But I believe now it's gotten to a point where It has substantially changed. Good news is that it can successfully parse all files in the standard emacs library without error, save a couple Ethiopian language files with unicode issues.

tpeacock19 avatar Sep 11 '22 15:09 tpeacock19

Yeah, it’s a bug that the field shows up at all in that case. The node is hidden, and doesn’t contain any visible nodes, so the field should do nothing, or give an error when generating the parser. Thanks for the report.

You should make the nodes visible if you want to attach a field to them. The usual approach is to make them anonymous.

maxbrunsfeld avatar Sep 11 '22 16:09 maxbrunsfeld

Using this:

    list: ($) =>
      seq(
        field("open", "("),
        field("value", repeat($._form)),
        field("close", ")")
      ),
    vector: ($) =>
      seq(
        "[",
        field("one", $.symbol),
        field("two", $.symbol),
        field("three", $.symbol),
        "]"
      ),

Does work for the vector example but not for the list with the repeat rule.

Syntax tree:
(list open: (paren_open) value: (symbol) value: (symbol) value: (symbol) value: (symbol) close: (paren_close))

first node: (paren_open)
	field: open
second node: (symbol)
	field: value
third node: (symbol)
	field: close
fourth node: (symbol)
	field: (null)

Syntax tree:
(vector (bracket_open) one: (symbol) two: (symbol) three: (symbol) (bracket_close))

first node: (bracket_open)
	field: (null)
second node: (symbol)
	field: one
third node: (symbol)
	field: two

Additionally, I still would not have a way to identify what index a specific node is. Currently the emacs workaround of determining the index involves counting prior siblings. But ts_node_prev_sibling does not seem to include either anonymous or hidden nodes.

tpeacock19 avatar Sep 11 '22 17:09 tpeacock19

I think this problem is the same as #1642. @tpeacock19, can you explain what is the emacs workaround that you mentioned? I suspect this workaround is present in #emacs-tree-sitter/elisp-tree-sitter, but not in emacs-29 and its native treesit package.

ptroja avatar Jan 05 '23 10:01 ptroja

Yeah, it’s a bug that the field shows up at all in that case. The node is hidden, and doesn’t contain any visible nodes, so the field should do nothing, or give an error when generating the parser. Thanks for the report.

You should make the nodes visible if you want to attach a field to them. The usual approach is to make them anonymous.

If I understand everything correctly here that fields attached to a hidden nodes inherited by all children nodes and I think it's a good standard behavior because there is an API that allows to retrieve a subset of nodes that have a same field name.

With an another representation the grammar in the issue looks like the next:

Screenshot from 2023-02-27 05-28-58

And it's seen that it doesn't contains anonymous nodes at all.

If to redefine the grammar like bellow:

const WHITESPACE_CHAR = /[\n\s]/;

const WHITESPACE = token(repeat1(WHITESPACE_CHAR));

module.exports = grammar({
  name: "elisptest",

  inline: $ => [
    $._paren_open,
    $._paren_close,
    $._bracket_open,
    $._bracket_close,
  ],

  rules: {
    source_file: $ => repeat(choice($._form, WHITESPACE)),

    _form: $ => choice($.list, $.vector, $.symbol),
    symbol: $ => /[a-zA-Z-]+/,

    _paren_open: $ => alias(token.immediate(/\(/), '('),
    _paren_close: $ => alias(token.immediate(/\)/), ')'),
    _bracket_open: $ => alias(token.immediate(/\[/), '['),
    _bracket_close: $ => alias(token.immediate(/\]/), ']'),

    list: $ =>
      seq(
        field("open", $._paren_open),
        field("value", repeat($._form)),
        field("close", $._paren_close)
      ),
    vector: $ =>
      seq(
        field("open", $._bracket_open),
        field("one", $.symbol),
        field("two", $.symbol),
        field("three", $.symbol),
        field("close", $._bracket_close)
      ),
  },
});

Then it would look like the bellow:

Screenshot from 2023-02-27 05-39-48

The anonymous terminal nodes would be visible and even addressable with fields.

ahlinc avatar Feb 27 '23 03:02 ahlinc