Packages icon indicating copy to clipboard operation
Packages copied to clipboard

[Regular Expression] Unified scope names for (embedded) Regular Expressions

Open jwortmann opened this issue 6 years ago • 10 comments

Various languages have inbuilt support for Regular Expressions and some default syntax packages provide rules to apply suitable scopes within RegExp strings. However, these scope names seem to differ in various places, for example the standalone RegExp syntax and Clojure use keyword.operator.alternation.regexp for the | symbol, while JavaScript, Python and PHP use keyword.operator.or.regexp. Another example are character classes such as \d or \w, which get the scope keyword.control.character-class.regexp in the standalone RegExp syntax, constant.other.character-class.escape.backslash.regexp in JavaScript and constant.character.character-class.regexp in Python and PHP. Other languages such as Tcl and Ruby recognize RegExp strings, but do not apply specific scopes other than string.regexp, which prevents syntax highlighting of Regular Expressions in these languages.

I want to refine my color scheme for consistent RegExp highlighting, but the currently used scope names make it difficult to find common highlighting rules for all languages. My knowledge of syntax definitions is somewhat limited, but as far as I know there is the possibility to embed a syntax within another language syntax (e.g. CSS in HTML). Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.

Regular Expression syntax:

    (?<=(T|t)he\s)(cat)$
(?#  ^^^ constant.other.assertion )
(?#       ^ keyword.operator.alternation )
(?#            ^^ keyword.control.character-class )
(?#                    ^ keyword.control.anchors )

JavaScript syntax:

var regex = /(?<=(T|t)he\s)(cat)$/;
//            ^^^ punctuation.definition.group.assertion
//                 ^ keyword.operator.or
//                      ^^ constant.other.character-class.escape.backslash
//                              ^ keyword.control.anchor

Python syntax:

regex = r'(?<=(T|t)he\s)(cat)$'
#          ^^^ constant.other.assertion
#               ^ keyword.operator.or
#                    ^^ constant.character.character-class
#                            ^ keyword.control.anchor

Ruby syntax:

regex = /(?<=(T|t)he\s)(cat)$/
#                    ^^ constant.character

Clojure syntax:

#"(?<=(T|t)he\s)(cat)$"
;  ^^^ constant.other.assertion
;       ^ keyword.operator.alternation
;            ^^ keyword.control.character-class
;                    ^ keyword.control.anchors

Progress

  • [x] Regular Expression
  • [ ] JavaScript
  • [x] Python
  • [x] Clojure
  • [ ] PHP
  • [ ] Ruby

jwortmann avatar Apr 16 '19 18:04 jwortmann

Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.

A very good point. I also wonder why an dedicated regexp syntax exists while different syntaxes use their own implementation. I can imagine two possible reasons:

  1. a historical thing of developement
  2. different feature levels and implementations of the underlying regexp engines of several languages, which make merging everything together impossible without causing things being highlighted in the wrong way for single syntaxes.
  3. the dedicated regexp syntax seems quite heavy compared to some others and causes significant slowdowns in parsing, when embedded to other languages. After embedding that syntax into a new TCL implementation to overcome the string.regexp.tcl limitations the parsing time of some official TCL library sources slowed down by 20 to 30%.

That said, I agree with regexp syntaxes to be a bit inconsistent in manner of scope naming. I'd guess the scopes were applied based on existing color schemes rather then by logical structure. One reason might be - there is no clear set of rules how to name different parts of a regexp?

I'd never call ?<= a constant for instance. As the definition of a lookbehind it would need to be scoped as keyword.operator or punctuation.definition.lookbehind. Same with all the parentheses. Thy are no operators but punctuations, ... .

\d and \w and friends are constant.character.escape.

deathaxe avatar Apr 19 '19 08:04 deathaxe

I definitely think number 2 is the biggest contributor, at least that is why I haven't switched any syntax definitions I have worked on to use the "generic" one (where number 1 applies). (it's not generic, it's designed with ST's Find functionality in mind - for example whether \< is an unnecessarily escaped char or a meta character depends on the engine used) I hadn't compared performance but it doesn't surprise me as the embedded regex definitions are generally much simpler and less accurate than the main standalone one (not referring to it as "generic" any more ;)) that said, clearly there is room for improvement/unification of scopes. Maybe the embedded ones could include contexts from the standalone one if we design it in such a way that those contexts are generic enough to apply to multiple regex parser/engine implementations, so that we don't duplicate work/scopes etc

keith-hall avatar Apr 19 '19 08:04 keith-hall

Now that we have syntax inheritance, it makes more sense than ever IMO to have a "base/common" regex syntax, and inherit from that for "extra" features not generally supported. So we might have separate syntaxes like (maybe we already do, didn't really check):

  • Regex Common
  • Regular Expressions (<- the one used in the Find panel)
  • PHP Regular Expressions
  • Python Regular Expressions

and so on. Then, most scopes would be set by Regex Common and it could solve any scope mismatches. It does mean we'd have more syntax definitions as they could no longer be embedded in the "owning language"'s syntax definition, but as they'd be hidden I think it would have little real-world impact.

keith-hall avatar Aug 25 '21 18:08 keith-hall

Would be a great improvement (and "some" work to do).

Python or PHP already maintain a dedicated own syntax definition file. Perl and some other syntaxes use Regular Expressions syntax. So number of syntaxes might not increase too much.

We might probably need some hidden intermediate syntaxes to properly support embedding/interpolation anyway in the future. Just see: #2654, #2789 or #2797.

deathaxe avatar Aug 25 '21 19:08 deathaxe

Is this basically solved now?

michaelblyons avatar Jan 09 '22 00:01 michaelblyons

Just JavaScript left I believe - it still has a Regex syntax which needs to be refactored to extend our base regex syntax

keith-hall avatar Jan 09 '22 05:01 keith-hall

PHP as well, IIRC?

deathaxe avatar Jan 09 '22 16:01 deathaxe

~~Only JavaScript now?~~

michaelblyons avatar May 21 '22 13:05 michaelblyons

Haven't touched PHP's regexp so far, with regards to reusing RegExp package.

deathaxe avatar May 21 '22 13:05 deathaxe

Ruby also uses regex stuff from its own syntax file, which is pretty minimal: no | and \-anything is a constant.character.escape.

Ruby does have some heuristic to make sure that /= is divide-and-assign, rather than opening a new regex.

michaelblyons avatar Jul 24 '23 14:07 michaelblyons