tree-sitter-clojure
tree-sitter-clojure copied to clipboard
Parsing drops symbols after metadata
(.get ^ByteBuffer b)
Produces the following parse tree:
{Node source (0, 0) - (1, 0)}
{Node list_lit (0, 0) - (0, 20)}
{Node ( (0, 0) - (0, 1)} "("
{Node sym_lit (0, 1) - (0, 5)} ".get"
{Node sym_lit (0, 6) - (0, 19)}
{Node meta_lit (0, 6) - (0, 17)}
{Node ^ (0, 6) - (0, 7)} "^"
{Node sym_lit (0, 7) - (0, 17)} "ByteBuffer"
{Node ) (0, 19) - (0, 20)} ")"
Note that b is missing.
It's true that the output is missing "b" and I need to take a closer look, but I don't think the symbol is dropped.
It's been a while but IIRC the grammar is designed so that metadata is treated as part of what it's "on", so in this case, it is treated as "part of" sym_lit:
sym_lit: $ =>
seq(repeat($._metadata_lit),
SYMBOL),
_metadata_lit: $ =>
seq(choice(field('meta', $.meta_lit),
field('old_meta', $.old_meta_lit)),
optional(repeat($._gap))),
IIUC the reported output indicates that the sym_lit in question spans from (0,6) to (0,19):
{Node sym_lit (0, 6) - (0, 19)}
{Node meta_lit (0, 6) - (0, 17)}
{Node ^ (0, 6) - (0, 7)} "^"
{Node sym_lit (0, 7) - (0, 17)} "ByteBuffer"
So I believe that in (.get ^ByteBuffer b), (0,7) - (0,17) corresponds to ByteBuffer and something like (0,18) - (0,19) corresponds to b.
If this analysis is correct, perhaps the lack of "b" in the reported output is a characteristic of tree-sitter itself?
What do you think?
I've taken a closer look (programmatically) and I think as you reported there is no node for SYMBOL.
I suppose it would be nicer to get:
{Node sym_lit (0, 6) - (0, 19)}
{Node meta_lit (0, 6) - (0, 17)}
{Node ^ (0, 6) - (0, 7)} "^"
{Node sym_lit (0, 7) - (0, 17)} "ByteBuffer"
{Node b (0, 18) - (0, 19)} "b"
tree-sitter parse --debug-graph ... gives the following for a file containing (.get ^ByteBuffer b):

I find it odd that "^", "(", and ")" end up as nodes and seemingly "b" does not.
Possibly related: https://github.com/tree-sitter/tree-sitter/issues/1156
Trying out the approach mentioned in https://github.com/Wilfred/tree-sitter-rust/commit/760985e2148a8282af86a554bf2e7d27189c2624, the following change to grammar.js:
hack_to_make_node_for_SYMBOL: $ =>
SYMBOL,
sym_lit: $ =>
seq(repeat($._metadata_lit),
$.hack_to_make_node_for_SYMBOL),
//SYMBOL),
produces parse output like:
(source [0, 0] - [1, 0]
(list_lit [0, 0] - [0, 20]
value: (sym_lit [0, 1] - [0, 5]
(hack_to_make_node_for_SYMBOL [0, 1] - [0, 5]))
value: (sym_lit [0, 6] - [0, 19]
meta: (meta_lit [0, 6] - [0, 17]
value: (sym_lit [0, 7] - [0, 17]
(hack_to_make_node_for_SYMBOL [0, 7] - [0, 17])))
(hack_to_make_node_for_SYMBOL [0, 18] - [0, 19]))))
for the test input (.get ^ByteBuffer b).
I tried a customized version of difftastic that uses (the very new) tree-sitter-clojure-simple on the original issue of:
https://github.com/AvisoNovate/pretty commit 304dfb9c8e9e2e97e6f0e825ce0313a23878941b

GIT_EXTERNAL_DIFF=difft.tscs git show 304dfb9c8e9e2e97e6f0e825ce0313a23878941b --ext-diff
tree-sitter parse --debug-graph showed:

For comparison, part of the original reported output was:

I don't know what command was used but the following produces what looks to be similar output (modulo the side-by-side aspect in the original):
GIT_EXTERNAL_DIFF=difft git show 304dfb9c8e9e2e97e6f0e825ce0313a23878941b --ext-diff
Thanks for taking the time to look into this.
hack_to_make_node_for_SYMBOL: $ =>
SYMBOL,
sym_lit: $ =>
seq(repeat($._metadata_lit),
$.hack_to_make_node_for_SYMBOL),
//SYMBOL),
This seems like a reasonable solution to me, it doesn't even seem too hacky! How do you feel about something like this?
sym_name: $ =>
SYMBOL,
sym_lit: $ =>
seq(repeat($._metadata_lit),
$.sym_name),
@Wilfred Thanks for your thoughts and suggestion.
I made a new branch that has the suggested change. Before that I added a corpus to the repository so now one should be able to see examples of how the parse results change in this commit: https://github.com/sogaiu/tree-sitter-clojure/commit/bfa6ab18a326d8a37407aedde24843ba1feac1f1. I wasn't expecting any surprises and I didn't notice anything odd.
I've also run the parser across a large sample of Clojure source files and the results were similar to the latest grammar.
I also locally tested a version of difftastic that uses the suggested changes. It seemed to produce the desired results.
To my knowledge, nvim-treesitter is the most significant user of tree-sitter-clojure, so I'd like to hear whether @theHamsta has any thoughts about the proposed changes.
I think adding additional nodes to the tree is entirely safe, although I'm not a nvim user.
In the meantime, I've updated difftastic to use your branch :)
Thanks for your thoughts.
I don't know enough about nvim-treesitter to know what all it does in detail (in general and in the specific case of tree-sitter-clojure), but FWIW I have code that depends on the internals of a different grammar. If similar changes had been made, I'm pretty sure the code would need to be adjusted.
Hopefully your safety point is correct though :)
@dannyfreeman @NoahTheDuke (and any other interested parties -- @Wilfred and @TheHamsta may be?)
As discussed starting here, I'd like to move toward incorporating some appropriate fix to address this issue (#21).
There is one aspect of the current candiate fix that I'm not very happy with though.
Unlike list_list, map_lit, vec_lit and other metadata-bearing things, the candidate fix implementation brings about an additional exposed node (sym_name):
sym_lit: $ =>
seq(repeat($._metadata_lit),
$.sym_name),
sym_name: $ =>
SYMBOL,
I'm still not really sure why the current implementation doesn't work:
sym_lit: $ =>
seq(repeat($._metadata_lit),
SYMBOL),
I think it may have something to do with https://github.com/tree-sitter/tree-sitter/issues/1156, but I haven't quite wrapped my head around it [1]. The candidate fix introduces what seems to me to be awkwardness that I would rather avoid if possible.
If anyone has some thoughts on this matter, I'd appreciate hearing them.
[1] May be it is related to the "lost nodes" that @ahlinc mentions here. For reference, SYMBOL is currently defined using token, as follows:
const SYMBOL_HEAD =
/[^\f\n\r\t ()\[\]{}"@~^;`\\,:#'0-9\u000B\u001C\u001D\u001E\u001F\u2028\u2029\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2008\u2009\u200a\u205f\u3000]/;
const SYMBOL_BODY =
choice(SYMBOL_HEAD,
/[:#'0-9]/);
// XXX: no attempt is made to enforce certain complex things, e.g.
//
// Symbols beginning or ending with ':' are reserved by Clojure.
// A symbol can contain one or more non-repeating ':'s
const SYMBOL =
token(seq(SYMBOL_HEAD,
repeat(SYMBOL_BODY)));
Forgive my ignorance, but without this construct of a sym_name token, how would I write a tree-sitter query against the following:
(.get ^TypeHint b)
and only select the token b, and NOT the metadata? A value: field could be used, but that isn't much different, just semantics. A field would also require an extra node.
In a way, sym_lit is a "container" token that can hold metadata and a symbol, like how a list can hold metadata and a symbol(s). Isn't this the only way tree sitter will allow the sym_lit to hold name and metadata token, and allow them to be selected independently?
@dannyfreeman Thanks for your thoughts.
I don't have a good answer to the question about expressing a query, but I thought it might be possible to use a wildcard node possibly in combination with anchoring. The following doesn't work [1] AFAIU, but to get a more concrete sense of what I had in mind:
(sym_lit
value: _ @name .)
This is assuming one used a field like you mentioned. (On a side note, I'm not so sure it isn't that different to use a field vs defining a new node, but we can come back to that later if necessary.)
Suppose though it were possible to express an appropriate query (or it became possible to), the situation still seems problematic to me because a node corresponding to b is inaccessible / missing when I make the grammar contain something like this:
sym_lit: $ =>
seq(repeat($._metadata_lit),
field('value', SYMBOL)),
When I call tree-sitter query with a query file with content:
_ @any
and a file (say issue.clj) with content:
(.get ^TypeHint b)
I get:
issue.clj
pattern: 0
capture: fun, start: (0, 0), end: (1, 0)
pattern: 0
capture: fun, start: (0, 0), text: "(.get ^TypeHint b)"
pattern: 0
capture: fun, start: (0, 0), text: "("
pattern: 0
capture: fun, start: (0, 1), text: ".get"
pattern: 0
capture: fun, start: (0, 6), text: "^TypeHint b"
pattern: 0
capture: fun, start: (0, 6), text: "^TypeHint"
pattern: 0
capture: fun, start: (0, 6), text: "^"
pattern: 0
capture: fun, start: (0, 7), text: "TypeHint"
pattern: 0
capture: fun, start: (0, 17), text: ")"
I don't see any output corresponding to b by itself.
I think this is essentially the same situation described above and it seems related to https://github.com/tree-sitter/tree-sitter/issues/1156.
[1] There may be other reasons the query is faulty, but one issue might be:
The restrictions placed on a pattern by an anchor operator ignore anonymous nodes.
I think I am starting to understand more now, thanks for your patience.
I have a source file with a single sym_lit node in it
^ByteBuffer b
on the master branch, this query matches
(sym_lit . meta: (_) @metadata .) @entire-symbol-with-meta
which says match a sym_lit with only a metadata node in it. Clearly that is wrong, there is more than a metadata node within the top level sym_lit. Valid Clojure could not produce a node like this which I think technically says "metadata attached to an empty symbol".
Instead this query should match
(sym_lit . meta: (_) @metadata . _ @symbol-name .) @entire-symbol-with-meta
but nothing is matched, because tree sitter has completely discarded the last anonymous token.
This also helps me answer my question from above
how would I write a tree-sitter query against the following:
(.get ^TypeHint b)and only select the token b, and NOT the metadata?
A fairly generic answer is (not accounting for old_meta:)
(sym_lit meta: (_)? . _ @symbol-name .)
This matches symbol names, whether or not they have metadata on them. And would work with the current grammar on master if tree-sitter was behaving correctly. Unfortunately it is not.
Going back to your above post
The candidate fix introduces what seems to me to be awkwardness that I would rather avoid if possible.
I think I would argue now that it is less awkward than what is on the master branch. It now gives a name to this anonymous node and makes it easier to capture without anchoring or skipping meta fields, so that this:
(sym_lit meta: (_)? . _ @symbol-name .)
// or
(sym_lit _ @symbol-name .)
can be replaced with
(sym_lit (sym_name) @symbol-name)
while BOTH queries remain valid. Although it would break any queries like this
(sym_lit . meta: (_) @metadata .)
But that is technically invalid even if the tree sitter bug you referenced produces nodes that match it. It's not possible to express an empty symbol in clojure with metadata attached to it.
On the topic of fields:
I did some poking around with the grammar and could not figure out why adding a field like you demonstrated also did not work.
sym_lit: $ =>
seq(repeat($._metadata_lit),
field('value', SYMBOL)),
But to address
(On a side note, I'm not so sure it isn't that different to use a field vs defining a new node, but we can come back to that later if necessary.)
All I meant by that was both a field or a named token like sym_name would in theory allow a user to select the symbol name only without the metadata. Seeing that trying to add a value field also fails though means it's not much of an immediate option.
@dannyfreeman Thanks for the detailed response and investigation.
Regarding the candidate fix:
I think I would argue now that it is less awkward than what is on the master branch.
Ok, I'm starting to feel more this way thanks to the discussion.
You also mentioned earlier:
In a way, sym_lit is a "container" token that can hold metadata and a symbol, like how a list can hold metadata and a symbol(s).
I think things like list_lit happen to not have a name-like bit to them and all of their "pieces" can be referred to either by name or literally (e.g. "(") so a similar issue of not being able to get at "pieces" doesn't arise [1].
Sort of related...it turns out a bit of regex_lit's structure is similar to sym_lit's:
https://github.com/sogaiu/tree-sitter-clojure/blob/8c23e0ec078af461ccad43fffbbfc204aa6bc238/grammar.js#L353-L355
https://github.com/sogaiu/tree-sitter-clojure/blob/8c23e0ec078af461ccad43fffbbfc204aa6bc238/grammar.js#L280-L282
Note that one ends in STRING while the other in SYMBOL, and both STRING and SYMBOL have definitions that start with token (cf. point 1 of https://github.com/tree-sitter/tree-sitter/issues/1156#issue-916833027).
I started to wonder if regex_lit might be better off as:
regex_lit: $ =>
seq(field('marker', "#"),
$.regex_val),
regex_val: $ =>
STRING,
but it seems in this case there may not be much be difference between capturing the entire regex_lit vs capturing the contained regex_val. AFAICT, the former just has a leading # while the latter doesn't.
IIUC, regular expression literals in Clojure don't allow anything between the leading # and the first double quote (unlike some other constructs like reader conditionals where #?(:clj 1 :cljs 2) and #? (:clj 1 :cljs 2) don't have a difference in meaning -- note that the latter has a space between the ? and the (). Since regular expression literals seem to always start with # and then are immediately followed by a string literal, they don't have the same sort of variable structure that symbols, lists and the like that can have metadata on them do.
But perhaps it would be nicer to make regex_lit similar to sym_lit. Feels more consistent / uniform?
WDYT?
[1] ahlinc had a suggestion about distinguishing 3 types of nodes here. By that terminology I think list_lit can only have "named" and "named anonymous" nodes (and thus no "unnamed anonymous") while sym_lit always has an "unnamed anonymous" node -- the name (by the definition in the current master branch any way).
He also mentioned in the same comment:
may be it would be a good feature that tree-sitter would allow to assign field names to the unnamed anonymous nodes so they would be addressable in such way even in the tree built even with the regular, not advanced, API.
That sounds like it could have been applied here...
Hi, I recently ran across some potentially related tree-sitter Clojure form extraction issues in the presence of the more general metadata reader form ^{...}. I wanted to mention it here in case it helps come up with a more comprehensive fix for this issue
^{:some-meta true}
{:a 1} ;; <-- attempting to get the node at the cursor seems to fail
@philomates
That seems like a problem with that repository's usage of tree-sitter-clojure, not tree-sitter-clojure itself.
That code returns a parse tree like the following:
(source [0, 0] - [1, 0]
(map_lit [0, 0] - [0, 25]
meta: (meta_lit [0, 0] - [0, 18]
value: (map_lit [0, 1] - [0, 18]
value: (kwd_lit [0, 2] - [0, 12])
value: (bool_lit [0, 13] - [0, 17])))
value: (kwd_lit [0, 20] - [0, 22])
value: (num_lit [0, 23] - [0, 24])))
Which contains the map literal with the metadata and it's k/v pairs. Whatever the code using tree-sitter is doing, it's not selecting the map_lit node form that point.
@sogaiu, I noticed the comment in the issue branch for regex_lit. I spent most of yesterday working on the grammar and added this in.
regex_lit: $ =>
seq(field('marker', "#"),
$.str_lit),
So I think that is another option for how regex lit could be handled. Your proposal and leaving it would also be fine from my perspective. Leaving it as is would make it difficult to capture the # separate from the string literal part is all. It is nice to have but not as important as the issue with the symbols.
On the topic of the symbols, I have some POC code that would resolve this issue and https://github.com/sogaiu/tree-sitter-clojure/issues/28 (for symbols, not just keywords) at the same time. By identifying distinct namespace and name parts for symbols, the parser picks up all of the symbol even when metadata is present.
Here's a parsed example
tree-sitter generate && tree-sitter parse <(echo ^ByteBuffer namespaced/symbol)
(source [0, 0] - [1, 0]
(sym_lit [0, 0] - [0, 29]
meta: (meta_lit [0, 0] - [0, 11]
value: (sym_lit [0, 1] - [0, 11]
name: (sym_name [0, 1] - [0, 11])))
namespace: (sym_ns [0, 12] - [0, 22])
delimiter: (delimiter [0, 22] - [0, 23])
name: (sym_name [0, 23] - [0, 29])))
With this parse tree the metadata an entire symbol can be selected independently
(sym_lit meta: _? @md _* @name)
// or the symbol can be captured in parts
(sym_lit meta: _? @md namespace: _ @ns name: _ @name)
The code that generates it
// Now excludes the / character
const SYMBOL_HEAD = /[^\f\n\r\t \/()\[\]{}"@~^;`\\,:#'0-9\u000B\u001C\u001D\u001E\u001F\u2028\u2029\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2008\u2009\u200a\u205f\u3000]/;
const SYMBOL_NS_DELIMITER = token("/");
const SYMBOL_BODY = choice(SYMBOL_HEAD, /[:#'0-9]/);
const SYMBOL_NAMESPACED_NAME = token(repeat(choice(SYMBOL_HEAD, /[\/:#'0-9]/))) ;
// XXX: no attempt is made to enforce certain complex things, e.g.
//
// Symbols beginning or ending with ':' are reserved by Clojure.
// A symbol can contain one or more non-repeating ':'s
const SYMBOL = token(seq(SYMBOL_HEAD, repeat(SYMBOL_BODY)));
// later in the grammer def //
_sym_qualified: $ =>
prec(1, seq(
field('namespace', alias(SYMBOL, $.sym_ns)),
field('delimiter', alias(SYMBOL_NS_DELIMITER, $.delimiter)),
field('name', alias(SYMBOL_NAMESPACED_NAME, $.sym_name))
)),
_sym_unqualified: $ =>
field('name', alias(
choice(
seq(SYMBOL_NS_DELIMITER, $._ws), // division symbol, must be followed by whitespace
SYMBOL
),
$.sym_name
)),
sym_lit: $ =>
seq(
repeat($._metadata_lit),
choice($._sym_qualified, $._sym_unqualified)
),
~The changeset I have now is a mess and needs to be cleaned up.~ It also needs some extensive testing. But I think this could also be considered as an alternative to the proposal of introudcing sym_name.
EDIT: Created a PR https://github.com/sogaiu/tree-sitter-clojure/pull/31
@philomates I got the same parse tree mentioned by dannyfreeman:
(source [0, 0] - [1, 0]
(map_lit [0, 0] - [0, 25]
meta: (meta_lit [0, 0] - [0, 18]
value: (map_lit [0, 1] - [0, 18]
value: (kwd_lit [0, 2] - [0, 12])
value: (bool_lit [0, 13] - [0, 17])))
value: (kwd_lit [0, 20] - [0, 22])
value: (num_lit [0, 23] - [0, 24])))
for the code:
^{:some-meta true} {:a 1}
It looks ok to me.
FWIW, this testing was done using 8c23e0ec078af461ccad43fffbbfc204aa6bc238.
In this comment, Olical mentioned:
Possibly to do with tree sitter or lack thereof?
It looks like Conjure can use tree-sitter but doesn't necessarily unless set up to do so: https://github.com/Olical/conjure#tree-sitter-support -- do you know if tree-sitter is working for your setup?
@dannyfreeman Thanks for your thoughts.
I'll hold off on deciding about what to do for regex_lit for the moment -- as you said, it seems less important than the sym_lit one.
I tried out #31 a bit and am in the process of examining test results.
I made a comment at the PR about one issue with this line:
seq(SYMBOL_NS_DELIMITER, $._ws), // division symbol, must be followed by whitespace
along with some examples demonstrating the issue.
This issue is resolved as of 7732550fa001c334dd99bea80f0ef6e397197a76 !!!