clojure-ts-mode icon indicating copy to clipboard operation
clojure-ts-mode copied to clipboard

Highlight (some) regular expressions using another grammar

Open sogaiu opened this issue 2 years ago • 7 comments
trafficstars

I saw the following bit in the emacs-devel archives:

some files may consist of several parts requiring different tree-sitter grammars. For example, a JavaScript file may have its documentation written with jsdoc: JavaScript and jsdoc have a tree-sitter grammar each.

Is there a way to use a tree-sitter grammar in parts of the file and another one in other parts? There could be a main grammar and secondary grammars would be activated on some kinds of nodes of the main one.

Yes, it should be possible, AFAIU. See the node "Multiple Languages" in the ELisp manual, I believe it explains how to do what you want.

As an idea for "somewhere down the line", perhaps it would be interesting to consider the following...

Since tree-sitter-clojure can recognize regex literals, may be one could apply an appropriate regular expression grammar to highlight the portions within the double quotes.

I don't know how close this grammar is to Clojure's flavor of regex, but may be it or some appropriate modification to it (or something that inherits from it) might be used for the task.

For reference, the part of the manual being referred to in the quote above can be see in .texi form here. I didn't manage to find an HTML version. If you've got a recent enough Emacs from the emacs-29 branch, the info may be viewable from within emacs. Worked for me anyway...


Ah sorry. May be I should have made this in the Discussions area?

sogaiu avatar May 29 '23 10:05 sogaiu

Ah sorry. May be I should have made this in the Discussions area

No an issue is fine. I don't even get notifications from discussions lol.

This is a good idea. Clojure uses java flavored regular expressions. I'm not sure how much they are different from that grammar. If it is it might be worth forking and calling it tree-sitter-java-regex if the dialects of regex have enough differences.

dannyfreeman avatar May 29 '23 13:05 dannyfreeman

I don't have the various flavors loaded into my head lately [1].

If I had to guess without looking too closely, I think this is likely to be some JavaScript flavor (or subset of one).

I also don't know / recall whether the various Clojure dialects all support the same regex syntax.

Perhaps this might come in handy eventually.


[1] Mostly working with PEGs in another language ;)

sogaiu avatar May 29 '23 22:05 sogaiu

Came across this content among Lapce's files:

((regex_lit) @injection.content
 (#set! injection.language "regex"))

sogaiu avatar Jun 20 '23 22:06 sogaiu

@sogaiu check this out 855cddd124eb4ed9197281fe7f56697902b35cb1

Seems useful for other languages as well. Maybe even belongs in emacs core.

dannyfreeman avatar Aug 24 '23 18:08 dannyfreeman

Thanks for the heads up!

Hope to take a look soon.

sogaiu avatar Aug 25 '23 00:08 sogaiu

Ok, I gave it a try.

I see about capturing #" and ":

clojure-ts-mode-with-regex

sogaiu avatar Aug 25 '23 01:08 sogaiu

On a side note, may be it's worth requesting that tree-sitter-regex get added to tree-sitter-module?

sogaiu avatar Aug 25 '23 01:08 sogaiu

@rrudakov Perhaps we can apply your learnings from the markdown-inline work here?

bbatsov avatar Apr 15 '25 17:04 bbatsov

@rrudakov Perhaps we can apply your learnings from the markdown-inline work here?

I think the biggest issue here is to find a proper grammar. The grammar mentioned in the discussion supports PCRE2, POSIX and JavaScript regexps, I'm not sure that any of those is fully compatible with Java regexps. One difference I can think of is using of double backslashes in Java.

If we find a grammar, adding a new parser and syntax highlighting is pretty straightforward.

rrudakov avatar Apr 15 '25 20:04 rrudakov

I think PCRE2 will work well for our case, as if I recall correctly Java's regular expressions were derived from Perl 5. We'll have to check this, though.

bbatsov avatar Apr 15 '25 20:04 bbatsov

Image

it works pretty well. We need to decide what do we want to highlight and which faces to use for different elements (I'm not a designer and I'm not a regex expert :) ). The possibilities for syntax highlighting are endless (see the syntax tree on the right buffer).

rrudakov avatar Apr 16 '25 19:04 rrudakov

Image

With dark color scheme.

rrudakov avatar Apr 16 '25 19:04 rrudakov

There is also an issue in Emacs. When local parsers are used, offset setting has no effect, so hash sign and quotes are also included into the range (it also applicable to our markdown-inline parser).

It's reported to Emacs bug tracker: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=77848

rrudakov avatar Apr 16 '25 19:04 rrudakov

Image

With dark color scheme.

This looks good to me. I was going to suggest to focus on match groups, character classes, escapes, anchors and modifiers and I guess that's what you did.

bbatsov avatar Apr 16 '25 20:04 bbatsov

It's reported to Emacs bug tracker: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=77848

The bug is fixed on Emacs master. On Emacs 30 the offset feature doesn't exist, which means that ranges for embedded parsers (markdown-inline and regex) will include quotes and hash character (for regex literal).

rrudakov avatar Apr 18 '25 19:04 rrudakov