Hidden Rules, Index, and Field Names
Hello,
I've run into an issue that I cannot seem to find a good workaround for. I'm trying to build a grammar for a lisp, emacs lisp specifically, and when constructing lists I would like to keep the delimeters out of the syntax tree. However, If I do so then there is no reliable way to obtain field name data as that index seems to include hidden nodes. I believe this was spoken to some in https://github.com/tree-sitter/tree-sitter/issues/1526.
Heres a simplified version of my grammar:
const WHITESPACE_CHAR = /[\n\s]/;
const WHITESPACE = token(repeat1(WHITESPACE_CHAR));
module.exports = grammar({
name: "elisptest",
rules: {
source_file: ($) => repeat(choice($._form, WHITESPACE)),
_form: ($) => choice($.list, $.vector, $.symbol),
symbol: ($) => /[a-zA-Z-]+/,
_paren_open: ($) => token.immediate("("),
_paren_close: ($) => token.immediate(")"),
_bracket_open: ($) => token.immediate("["),
_bracket_close: ($) => token.immediate("]"),
list: ($) =>
seq(
field("open", $._paren_open),
field("value", repeat($._form)),
field("close", $._paren_close)
),
vector: ($) =>
seq(
field("open", $._bracket_open),
field("one", $.symbol),
field("two", $.symbol),
field("three", $.symbol),
field("close", $._bracket_close)
),
},
});
Issues:
Hidden Index
The grammar for vectors above with the string [foo bar world] outputs the below. Since I am unable to use the C api to identify the correct index for the nodes the fields are off. I know that I can just +1 to the index for vectors/lists but it does not seem to be a real solution.
Syntax tree:
(vector one: (symbol) two: (symbol) three: (symbol))
first node: (symbol)
field: open
second node: (symbol)
field: one
third node: (symbol)
field: two
Repeat rule:
With the list grammar above (foo bar world star) would be the below. It does not seem like the parser identifies the repeat rule and gives the subsequent nodes an index. However, the syntax tree does reflect this correctly.
Syntax tree:
(list value: (symbol) value: (symbol) value: (symbol) value: (symbol))
first node: (symbol)
field: open
second node: (symbol)
field: value
third node: (symbol)
field: close
fourth node: (symbol)
field: (null)
As you can see the syntax tree does reflect the correct values. Ideally for the first issue there would be a named version of ts_node_field_name_for_child similar to ts_node_child & ts_node_named_child. However, I'm not sure of how to solve for the second. Please let me know if there is any better or more appropriate way to build my grammar. Thanks!
Unrelated to the question, but are you aware of this? https://github.com/Wilfred/tree-sitter-elisp
Yes, I've actually started just forming that and making modifications here https://github.com/tpeacock19/tree-sitter-elisp. But I believe now it's gotten to a point where It has substantially changed. Good news is that it can successfully parse all files in the standard emacs library without error, save a couple Ethiopian language files with unicode issues.
Yeah, it’s a bug that the field shows up at all in that case. The node is hidden, and doesn’t contain any visible nodes, so the field should do nothing, or give an error when generating the parser. Thanks for the report.
You should make the nodes visible if you want to attach a field to them. The usual approach is to make them anonymous.
Using this:
list: ($) =>
seq(
field("open", "("),
field("value", repeat($._form)),
field("close", ")")
),
vector: ($) =>
seq(
"[",
field("one", $.symbol),
field("two", $.symbol),
field("three", $.symbol),
"]"
),
Does work for the vector example but not for the list with the repeat rule.
Syntax tree:
(list open: (paren_open) value: (symbol) value: (symbol) value: (symbol) value: (symbol) close: (paren_close))
first node: (paren_open)
field: open
second node: (symbol)
field: value
third node: (symbol)
field: close
fourth node: (symbol)
field: (null)
Syntax tree:
(vector (bracket_open) one: (symbol) two: (symbol) three: (symbol) (bracket_close))
first node: (bracket_open)
field: (null)
second node: (symbol)
field: one
third node: (symbol)
field: two
Additionally, I still would not have a way to identify what index a specific node is. Currently the emacs workaround of determining the index involves counting prior siblings. But ts_node_prev_sibling does not seem to include either anonymous or hidden nodes.
I think this problem is the same as #1642. @tpeacock19, can you explain what is the emacs workaround that you mentioned? I suspect this workaround is present in #emacs-tree-sitter/elisp-tree-sitter, but not in emacs-29 and its native treesit package.
Yeah, it’s a bug that the field shows up at all in that case. The node is hidden, and doesn’t contain any visible nodes, so the field should do nothing, or give an error when generating the parser. Thanks for the report.
You should make the nodes visible if you want to attach a field to them. The usual approach is to make them anonymous.
If I understand everything correctly here that fields attached to a hidden nodes inherited by all children nodes and I think it's a good standard behavior because there is an API that allows to retrieve a subset of nodes that have a same field name.
With an another representation the grammar in the issue looks like the next:

And it's seen that it doesn't contains anonymous nodes at all.
If to redefine the grammar like bellow:
const WHITESPACE_CHAR = /[\n\s]/;
const WHITESPACE = token(repeat1(WHITESPACE_CHAR));
module.exports = grammar({
name: "elisptest",
inline: $ => [
$._paren_open,
$._paren_close,
$._bracket_open,
$._bracket_close,
],
rules: {
source_file: $ => repeat(choice($._form, WHITESPACE)),
_form: $ => choice($.list, $.vector, $.symbol),
symbol: $ => /[a-zA-Z-]+/,
_paren_open: $ => alias(token.immediate(/\(/), '('),
_paren_close: $ => alias(token.immediate(/\)/), ')'),
_bracket_open: $ => alias(token.immediate(/\[/), '['),
_bracket_close: $ => alias(token.immediate(/\]/), ']'),
list: $ =>
seq(
field("open", $._paren_open),
field("value", repeat($._form)),
field("close", $._paren_close)
),
vector: $ =>
seq(
field("open", $._bracket_open),
field("one", $.symbol),
field("two", $.symbol),
field("three", $.symbol),
field("close", $._bracket_close)
),
},
});
Then it would look like the bellow:

The anonymous terminal nodes would be visible and even addressable with fields.