`show-paren-mode` highlights irrelevant things for some expressions
For treesitter major modes, show-paren-data-function is set to treesit-show-paren-data by default. This function highlights the first and the last children of a list "thing", which is not always matching pairs. For example:
- Function literals:
- Sets:
- Expressions with metadata (BTW, I really don't like how the grammar works with metadata, it's pain to work with and causes more problems than benefits):
I can implement a custom show-paren-data-function for clojure-ts-mode that will always highlight matching pairs. In this case navigation might look inconsistent, because an opening paren is somewhere in the middle of a node. For example from this position:
executing backward-sexp would bring cursor to this position:
because the vector apparently starts from ^.
Would this change be an improvement? Or we better keep the existing behavior?
Expressions with metadata (BTW, I really don't like how the grammar works with metadata, it's pain to work with and causes more problems than benefits)
Please feel free to share further thoughts on this.
FWIW, it wasn't easy to get it to work and it was done very intentionally to not repeat how rewrite-clj does things. It was the way that occurred to me, but perhaps there are other (tree-sitter-compatible practical) approaches that are overall wins.
At this point I'm not sure if it's practical to change but as you can see the grammar is not that complicated so it may be feasible to create an alternate one. If going that route, may be people can share / enumerate any other things they'd like to be different.
Off the top of my head, there is at least one thing we did that if we could start over I would go for and it's this.
Please feel free to share further thoughts on this.
Sure. Here are some issues
Nodes boundaries
If an expression has metadata, the beginning of the expression automatically becomes the first character of the metadata node. So thinks like beginning-of-defun, beginning-of-sexp etc will navigate to the wrong place IMO. For example:
^{:some "meta"}
(defn to-string
^String
[arg]
(.toString arg))
executing C-M-a will put the point to the ^ character of the ^{:some "meta"}.
This also breaks functions like raise-sexp, backward-up-list etc because they use sexp navigation functions internally.
For refactoring commands we often need to find beginning of a sexp. Ideally it should be (treesit-node-start node), but we need to skip the possible metadata child and find an opening paren node: https://github.com/rrudakov/clojure-ts-mode/blob/3522b57877b2b9397696ab2e97062a8f25b12e3e/clojure-ts-mode.el#L2111https://github.com/rrudakov/clojure-ts-mode/blob/3522b57877b2b9397696ab2e97062a8f25b12e3e/clojure-ts-mode.el#L2111
Partially related
Some other node type like anon_fn_lit or set_lit also have issues with boundaries. I recently reported an Emacs bug: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=78458. So far there is no solution.
I'm curious why some nodes have separate marker and open nodes?
(anon_fn_lit marker: # open: (
value: (sym_lit name: (sym_name))
value: (num_lit)
value: (sym_lit name: (sym_name))
close: ))(anon_fn_lit marker: # open: (
value: (sym_lit name: (sym_name))
value: (num_lit)
value: (sym_lit name: (sym_name))
close: ))
Wouldn't it be simpler to have a single open node as #( for function literals and #{ for sets?
Lots of metadata checks in clojure-ts-mode
In the same example as above, getting the first symbol of the function definition list node (defn), ideally i'd just like to use (treesit-node-child <list_lit_node> 0 t), but the first child of the list_node is metadata node (even though it's visually outside), so we have to always check that the first child is not a metadata (https://github.com/rrudakov/clojure-ts-mode/blob/3522b57877b2b9397696ab2e97062a8f25b12e3e/clojure-ts-mode.el#L813).
Some indentation rules require that metadata would not be treated as part of the next expression (see https://github.com/clojure-emacs/clojure-ts-mode/issues/61#issuecomment-2759498623), so we need to do hacks like https://github.com/rrudakov/clojure-ts-mode/blob/3522b57877b2b9397696ab2e97062a8f25b12e3e/clojure-ts-mode.el#L1401
One can just grep for "metadata" string in the clojure-ts-mode.el.
Semantic
In the example code snippet the ^String metadata indicates the type of a value returned from the function, but it's parsed as part of the [arg] vector.
In the expression:
[1 ^long 2 3]
the parsed tree is:
(vec_lit open: [ value: (num_lit)
value:
(sym_lit
meta:
(meta_lit marker: ^
value: (sym_lit name: (sym_name)))
(sym_name))
value: (num_lit) value: (num_lit) close: ])
^long is not part of the 2, although semantically it is its type hint. Same for [1 ^String "hello" 3].
But in the expression:
[1 ^String (str "hello" "world") 3]
the parsed tree is:
(vec_lit open: [ value: (num_lit)
value:
(list_lit
meta:
(meta_lit marker: ^
value: (sym_lit name: (sym_name)))
open: (
value: (sym_lit name: (sym_name))
value: (str_lit) value: (str_lit) close: ))
value: (num_lit) close: ])
so ^String is part of the list (str "hello" "world").
I might have forgotten something, this is still fresh in my memory :)
Off the top of my head, there is at least one thing we did that if we could start over I would go for and it's this.
Not sure about other editors, but in Emacs it's quite simple to parse arbitrary nodes using other grammars, we do that for docstrings and regexp literals in clojure-ts-mode.el, so I'm not sure that special node type is required for that.
At this point I'm not sure if it's practical to change but as you can see the grammar is not that complicated so it may be feasible to create an alternate one. If going that route, may be people can share / enumerate any other things they'd like to be different.
One thing that I mentioned in my first response, but it's probably worth discussing separately, is separation of marker and open nodes for some types. But I'm not 100% sure it will solve the issues mentioned in the Emacs bug report. Maybe it could be solved in Emacs itself.
Thanks for the details and your thoughts.
I'll try to reconstruct some reasoning for why certain things are the way they are, but it has been quite a while and I may not succeed too well (^^;
I believe part of the original motivation for not doing it the way rewrite-clj does it is described here. In rewrite-clj, if metadata exists, it ends up containing the thing it is metadata for ("metadatee") -- in the simple case of a single piece of metadata. This leads to complicated processing because if one encountered metadata, one would need to "dig in" to reach the corresponding metadatee, sometimes repeatedly (I think for when there was more than one piece of metadata).
tree-sitter-clojure went with the idea of providing a "logical" view of things which is that the metadata nodes are part of the corresponding metadatee node. For future readers, a concrete example is list_lit:
list_lit: $ =>
seq(repeat($._metadata_lit),
$._bare_list_lit),
_bare_list_lit: $ =>
seq(field('open', "("),
repeat(choice(field('value', $._form),
$._gap)),
field('close', ")")),
This approach has the upside of being able to simply get at all of the metadata from the corresponding metadatee. It has at least the downside described in this earlier comment.
An alternate approach (I'm guessing hinted at in earlier comments?) is for individual pieces of metadata to have their own completely separate nodes in the parse tree. That is, the metadata nodes just live before the corresponding metadatee node and are not children of the metadatee node.
If we collect enough flaws / deficiencies of the existing grammar, may be it will become clear that a different grammar would be worth making.
why some nodes have separate marker and open nodes?
I think open and close fields are only defined on the _bare_*_lit nodes (for list, map, vec and set). These were the corresponding delimiters for the constructs that the nodes represent.
anon_fn_lit, read_cond_lit, and splicing_read_cond_lit all reuse _bare_list_lit and they all start with a different "sigil" (i.e. #, #?, and #?@, respectively), so they all got their own marker fields.
A similar situation exists for other nodes too (e.g. ns_map_lit).
I think there is a certain consistency to this approach, but if one were to start over, may be there is more value in the suggested alternate idea.
Regarding the semantic points, I think I failed to consider such cases. Thanks for mentioning these, may be it's further evidence in support of another grammar.
If you don't mind, would you provide an example of where something like [1 ^long 2 3] is valid? I'm not doubting it can be, it's just that when I enter that as-is via clj I get:
$ clj
Clojure 1.12.0
user=> [1 ^long 2 3]
Syntax error reading source at (REPL:1:11).
Metadata can only be applied to IMetas
3
Syntax error reading source at (REPL:1:14).
Unmatched delimiter: ]
Off the top of my head, there is at least one thing we did that if we could start over I would go for and it's this.
Not sure about other editors, but in Emacs it's quite simple to parse arbitrary nodes using other grammars, we do that for docstrings and regexp literals in
clojure-ts-mode.el, so I'm not sure that special node type is required for that.
The TLDR is that not all editors support the ability to "offset" and some grammars are more friendly to those editors. I think it's possible it might be more efficient, but I haven't measured.
Since tree-sitter-clojure is used by more than Emacs, it is something we considered. Had we known much earlier, I think we would have added support. These days more editors may have adequate support so it may not be as much of an issue, but AFAIU it is more work for an editor.
If you don't mind, would you provide an example of where something like
[1 ^long 2 3]is valid? I'm not doubting it can be, it's just that when I enter that as-is viacljI get:
Actually, you're right, it's not valid. I noticed it while I was implementing indentation rules for collection items with metadata and clj-kondo doesn't report it as an invalid expression, but I never actually tried to evaluate it. My bad.
I think I can remove one of the unnecessary hacks from the code :) Thanks!
An alternate approach (I'm guessing hinted at in earlier comments?) is for individual pieces of metadata to have their own completely separate nodes in the parse tree. That is, the metadata nodes just live before the corresponding metadatee node and are not children of the metadatee node.
The biggest issue IMO is navigation, and it something that should work very well in a text editor. I guess rewrite-clj has slightly different purpose than being used in a text editor.
I think having metadata as standalone nodes would solve a lot of code navigation issues at least in Emacs.
As a workaround we could probably write custom up-list-function, forward-sexp-function, forward-list-function etc specifically for clojure-ts-mode that would consider potential metadata and navigate point to a proper place. That would require much more work than just using built-in features provided by treesit.el.
Thanks for the clarification about [1 ^long 2 3].
I think having metadata as standalone nodes would solve a lot of code navigation issues at least in Emacs.
Apart from whether there are issues for Emacs, the return type metadata case is not correctly handled semantically as you pointed out. I think it is a pretty serious flaw with the grammar's existing approach to metadata...one that doesn't seem like it could be "patched" with minimal changes. If it's possible to get the standalone nodes approach to work with tree-sitter, it seems like it would be more "correct".
Regarding the approach taken in rewrite-clj, I'm not sure the original author was happy with the decision in retrospect but I don't have a reference for that idea :)
Hmmm, come to think of it, I wonder how rewrite-clj handles the return type case...I suspect that there would be a metadata node that contains the argument vector...
Regarding the semantic points, I think I failed to consider such cases.
(defn to-string
^String
[arg]
(.toString arg))
this point is still valid, ^String here is not related to the [arg] vector.
Yes, I agree. Please see my comment above :)
Hmmm, come to think of it, I wonder how rewrite-clj handles the return type case...I suspect that there would be a metadata node that contains the argument vector...
Looks like it:
$ rlwrap bb
Babashka v1.12.197 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (require '[rewrite-clj.zip :as z])
nil
user=> (def zloc (z/of-string "(defn a-fn ^String [x] (.toString x))"))
#'user/zloc
user=> (-> zloc z/down z/right z/right z/node)
<meta: ^String [x]>
FWIW, here's my attempt to make forward/backward-sexp to behave similar to clojure-mode:
forward-backward-sexp.el
(defun clj-open-node-p (node)
(treesit-node-match-p node (rx (or "(" "{" "["))))
(defun clj-close-node-p (node)
(treesit-node-match-p node (rx (or ")" "}" "]"))))
(defun clj-forward-single-sexp (node-at-point)
(let* ((pos (point))
(parent-sexp (treesit-thing-at pos 'sexp)))
(cond
;; Check if node at point is an opening paren. Then we simply
;; navigate to the end of the parent node.
((and parent-sexp (clj-open-node-p node-at-point))
(goto-char (treesit-node-end parent-sexp)))
;; Check if we are at the end of a current node and the next
;; sibling is an opening paren. Then we also navigate to the
;; end of the parent node.
((and parent-sexp
(= (treesit-node-end node-at-point) pos)
(thread-first parent-sexp
(treesit-node-first-child-for-pos pos)
(clj-open-node-p)))
(goto-char (treesit-node-end parent-sexp)))
;; Check if we are at the end of a current node and the next
;; sibling is a node with metadata. Then we navigate to the end
;; of the metadata node.
((and parent-sexp
(= (treesit-node-end node-at-point) pos)
(thread-first parent-sexp
(treesit-node-first-child-for-pos pos)
(treesit-node-child 0 t)
(clojure-ts--metadata-node-p))
(goto-char (thread-first parent-sexp
(treesit-node-first-child-for-pos pos)
(treesit-node-child 0 t)
(treesit-node-end)))))
;; If we are at the beginning of a parent with metadata, we
;; should navigate to the end of metadata.
((treesit-node-match-p node-at-point (rx (or "^" "#^")))
(goto-char (treesit-node-end (clojure-ts--parent-until 'sexp))))
;; At last we fallback to the original behavior.
(t (treesit-end-of-thing 'sexp 1 'restricted)))))
(defun clj-backward-single-sexp (node-at-point)
(let* ((pos (point))
(node-before-point (treesit-node-at (1- pos) (treesit-language-at pos))))
(cond
;; Check if node at point is an opening paren and node before is a metadata
;; node. Then we should navigate to the beginning of metadata.
((and (clj-open-node-p node-at-point)
(clojure-ts--metadata-node-p (treesit-node-prev-sibling node-at-point t)))
(goto-char (treesit-node-start (treesit-node-prev-sibling node-at-point t))))
;; Check if we are at the end of a metadata node. Then we should navigate
;; to the beninning of it.
((and (= (treesit-node-end node-at-point) pos)
(or (clojure-ts--metadata-node-p (treesit-node-parent node-at-point))
(clojure-ts--metadata-node-p (treesit-node-parent (treesit-node-parent node-at-point)))
(clojure-ts--metadata-node-p (treesit-node-parent (treesit-node-parent (treesit-node-parent node-at-point))))))
(goto-char (treesit-node-start (treesit-parent-until node-at-point "meta_lit" t))))
;; Check if node before point is a closing paren. Then we should
;; navigate to a matching opening paren.
((and node-before-point
(clj-close-node-p node-before-point)
(<= (treesit-node-start node-before-point)
(1- pos)
(treesit-node-end node-before-point)))
(goto-char (clojure-ts--node-start-skip-metadata (treesit-thing-at (1- pos) 'sexp t))))
;; Check if node before point is an opening paren. Then we cannot go further.
((and node-before-point
(clj-open-node-p node-before-point)
(<= (treesit-node-start node-before-point)
(1- pos)
(treesit-node-end node-before-point)))
nil)
;; At last we fallback to the original behavior.
(t (treesit-beginning-of-thing 'sexp 1 'restricted)))))
(defun clj-forward-sexp-internal (node-at-point arg)
(let ((res))
(dotimes (_ arg)
(setq res (clj-forward-single-sexp node-at-point)))
res))
(defun clj-backward-sexp-internal (node-at-point arg)
(let ((res))
(dotimes (_ (abs arg))
(setq res (clj-backward-single-sexp node-at-point)))
res))
(defun clj-forward-sexp (&optional arg)
(interactive "^p")
(let* ((arg (or arg 1))
(node-at-point
(treesit-node-at (point) (treesit-language-at (point)))))
(or (when (and node-at-point
;; Make sure point is strictly inside node.
(< (treesit-node-start node-at-point)
(point)
(treesit-node-end node-at-point))
(treesit-node-match-p node-at-point 'text t))
(forward-sexp-default-function arg)
t)
(if (> arg 0)
(clj-forward-sexp-internal node-at-point arg)
(clj-backward-sexp-internal node-at-point arg))
;; If we couldn't move, we should signal an error and report
;; the obstacle, like `forward-sexp' does. If we couldn't
;; find a parent, we simply return nil without moving point,
;; then functions like `up-list' will signal "at top level".
(treesit--scan-error 'sexp arg))))
There are still some cases when it doesn't work as expected. Other functions, like up-list, down-list etc would have similar complexity.
To follow up a bit on this:
why some nodes have separate marker and open nodes?
While nodes such as set_lit and anon_fn_lit must have the appropriate opening delimiter right after their leading #, there are some nodes that are "flexible" in what they allow immediately after their marker / prefix fields.
Specific examples include (but are not limited to) read_cont_lit, splicing_read_cond_lit, ns_map_lit, and var_quoting_lit.
These all allow something (including whitespace) between their marker fields and what follows.
For example, the following are all legitimate AFAIU:
#? (:clj 1 :cljr 2) ;; space after `#?`
[#?@ (:clj () :cljr [])] ;; space after `#?@`
#:prefix {:a 1} ;; space after `#:prefix`
#:: {} ;; space after `#::`
#' #?(:clj prn :cljs print) ;; space after `#'`
,,,
For reference, below is the current definition for ns_map_lit:
ns_map_lit: $ =>
seq(repeat($._metadata_lit),
field('marker', "#"),
field('prefix', choice($.auto_res_mark,
$.kwd_lit)),
repeat($._gap),
$._bare_map_lit),
Note that after the prefix field (which could potentially be merged with the marker field), there is repeat($._gap), and _gap is defined as:
_gap: $ =>
choice($._ws,
$.comment,
$.dis_expr),
From this I hope it is clear "how" there can be a variety of things between a marker / prefix field and an opening delimiter for some nodes.
Thanks for the explanation @sogaiu. I'm going to try what Juri suggested here (adding sexp-default thing). Perhaps it will solve the issue with anon_fn_lit and set_lit for clojure-ts-mode.
Also I really like the modified grammar with standalone metadata nodes, I didn't have to change much code, all the features work and built-in navigation is way better.
I've pushed a feature branch which uses the grammar with standalone metadata nodes: https://github.com/rrudakov/clojure-ts-mode/tree/feature/alternative-grammar
All tests pass, all the existing features work and I also fixed a few bugs I've found (irrelevant to the discussed topic). I'm going to use it until we make a decision about new grammar.
In this branch the problem, mentioned in the issue description, simply doesn't exist for nodes with metadata, because they are not part of the following node anymore. The issue with markers for function literals and sets still exists, but it's minor and can be fixed.
The issue with navigation using C-M-f still exists for sets and function literals, I hope to get some help on how to solve it from the Emacs mailing list. The proposed solution with defining sexp-default thing didn't work.
We can close this one now, when #99 is merged. I'm really happy about it. Thanks to everyone involved!
Thanks for working on this! 🙇♂