regex icon indicating copy to clipboard operation
regex copied to clipboard

permit some no-op escape sequences for compatibility purposes

Open KiChjang opened this issue 6 years ago • 23 comments

Some of the regexes found in https://github.com/ua-parser/uap-core is throwing errors when parsed with the regex crate:

regex parse error:
    (?:\/[A-Za-z0-9\.]+)? *([A-Za-z0-9 \-_\!\[\]:]*(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]*))/(\d+)(?:\.(\d+)(?:\.(\d+))?)?
       ^^
error: unrecognized escape sequence

This sounds like we're deviating from the regex spec here. Can someone confirm?

KiChjang avatar Jul 28 '18 00:07 KiChjang

Which regex spec are you referring to?

Otherwise, yes, this crate disallows unnecessary escapes. This is to permit the addition of new escapes in a backwards compatible way. It is plausible that we could allow certainly no-op escapes (such as for /), but this may or may not solve the large problem.

BurntSushi avatar Jul 28 '18 00:07 BurntSushi

So I was referring to the ES6 spec for regexes. I believe that there are production-grade projects such as the one I linked in my first post which does contain unnecessary backslashes, and I think this crate should definitely provide a way to accept these regexes as-is, without modifying the regexes contained within to conform to the regex syntax introduced in this crate.

For the project I linked, here's a shortlist of what I did to make it compile with this crate:

  • \/ -> /
  • \! -> !
  • \ ->
  • |) -> )? (empty alternations)

KiChjang avatar Jul 31 '18 00:07 KiChjang

So I was referring to the ES6 spec for regexes.

This crate definitely does not, and never will, conform to the ES6 specification for regexes. It is a complete non-goal.

I am much more sympathetic to your practical concerns. I'm generally strongly opposed to allowing escapes to always work even when they are no-ops, but I could get on board with selecting a set of commonly escaped characters in the wild.

It might also be smart to try to fix those projects such that they don't use unnecessary escapes.

Empty alternations are something I'd also like to support, but a bug in the compiler prevents it for now.

BurntSushi avatar Jul 31 '18 00:07 BurntSushi

We actually made own Rust wrapper for uap-core, and yes, we had to fix unnecessary escapes. Perhaps we could open-source it to avoid duplication of efforts?

RReverser avatar Aug 16 '18 12:08 RReverser

@RReverser Were escapes the only reason for that? Or were there other issues that needed to be papered over?

BurntSushi avatar Aug 16 '18 12:08 BurntSushi

the only reason for that

Only reason for what? Implementing Rust version of uap-core?

RReverser avatar Aug 16 '18 15:08 RReverser

@RReverser In the process of doing that, you said you had to fix unnecessary escapes. Was there anything else you needed to do with the regexes specifically to make them work with Rust's regex engine?

BurntSushi avatar Aug 16 '18 15:08 BurntSushi

Ah. Well, another fix I had to do was to replace \d to match only ASCII digits, since it turned out to include Unicode ones as well, which should not be allowed from the point of UA parser, although in all other regards strings should be still matched in Unicode-aware mode (I suppose you remember our discussion about this).

Other than that, no other fixes were necessary, although I did write a bunch of extra analysis and rewrites using regex-syntax to optimise common cases.

RReverser avatar Aug 16 '18 17:08 RReverser

In an effort to keep conversation on this topic in one place, I'm going to respond to @zackw's issue #522 here:

To recap, my high level current thinking on this topic:

I'm generally strongly opposed to allowing escapes to always work even when they are no-ops, but I could get on board with selecting a set of commonly escaped characters in the wild.

The problem here is that I don't know how much effort we should expend to make the syntax compatible with other regex engines. Surely, we can all agree that 100% compatibility can never happen or be expected. So we have to choose some set of features that gets us part of the way there. I just don't know which things to choose.

The purpose for the current behavior is to permit backwards compatible additions to the syntax of regexes. Some regex engines were developed with very little foresight. Python's regex engine, for example, will permit just about anything to be escaped. Some escapes are significant, but the escapes that aren't significant behave as if they weren't escaped at all. This makes it impossible to add new escape sequences since existing escapes are valid and have specified match semantics.

This regex library chooses to return an error for non-significant escape sequences precisely because we consider turning invalid syntax into valid syntax to be a backwards compatible change, and we can legitimately get away with that by enforcing it.

This framework does not specifically forbid insignificant escape sequences. We can simply choose the ones we want to explicitly allow. , " and ' are all candidates, but they are far from the only ones. Moreover, even if we were to add new escape sequences in the future, it would probably be poor judgment to use an escape sequence that is commonly used in other regex engines as an insignificant escape sequence. e.g., Prescribing special meaning to \" while another regex engine just treats it as a literal " is probably poor form.

I think the most compelling use cases, from my perspective, are huge libraries of regexes. However, in practice, it seems quite difficult to just expect to be able to compile them without any other changes. As @RReverser notes above, the \d, \w and \s escape sequences are all Unicode aware by default, which is not common in other regex engines, which were likely built in a time before Unicode was as widespread as it is today. Therefore, even if we fix the cases of escape sequences, you still wind up with subtle match differences. If pressed, I am sure I could come up with a list of several more cases. What I'm trying to say here is that it may be unreasonable to expect a large existing library of regexes---written specifically for one particular regex engine---to just work out of the box on a different regex engine, even if it might in practice in some number cases.

BurntSushi avatar Oct 08 '18 13:10 BurntSushi

I understand and appreciate your position that no-op escapes should not be added just because they are no-op escapes in other regex engines.

I would like to offer an independent argument for no-op \", \' and \/ (didn't think of \/ before, but yes, that one too) based on the fact that ", ', and / are commonly used to delimit regex literals in many different languages and contexts, and \", \', \/ are commonly understood to escape the delimitation (i.e. extend the regex literal past a point where it would otherwise have ended). In many cases, backslashes that escape delimitation will be stripped by the "outer" parser before the regex engine sees them, but not all (e.g. the Python raw strings I mentioned in #522). Therefore, no-op \", \' and \/ is not just a matter of compatibility with other regex engines, but other surrounding contexts besides Rust source code.

zackw avatar Oct 08 '18 14:10 zackw

@zackw Ah I see. I don't think I appreciated that point about Python's raw strings when I first read it in #522. Thanks for mentioning that again.

OK, so how about we start off by making ", ', (space, 0x20), / and ! escapeable but no-op? This will also cement the syntax such that these characters can never be used as a valid escape sequence that does anything other than match the literal being escaped. I think for these characters, that would be reasonable.

We can add more no-op escapes moving forward if they creep up, but I think these are probably the ones I see escaped most often.

BurntSushi avatar Oct 08 '18 14:10 BurntSushi

@burntsushi That sounds reasonable to me. About the empty alternations -- is there a tracking issue in rustc for the bug you mentioned?

KiChjang avatar Oct 08 '18 16:10 KiChjang

@KiChjang Errm, the "compiler" in this context refers to the regex compiler, not rustc. Sorry about the mixup. But no, there is no issue for it because I don't yet understand the bug and it only manifests when empty alternations are allowed. There probably should be an issue for the feature of empty alternations though.

BurntSushi avatar Oct 08 '18 17:10 BurntSushi

+1 from me on no-op treatment for ", ', /, and space. +0 on !. I don't know of any context where ! is used as a delimiter, except sed's alternative-delimiter notation, m!...! where ! can be any single ASCII character; since it can be any character, that notation shouldn't be an argument for anything.

I see that @KiChjang originally asked for \! to be a no-op because of an existing body of JavaScript regexes that use it. JavaScript regex syntax defines \h to be equivalent to h for all characters h where \h has not already been assigned a special meaning (if I'm reading https://tc39.github.io/ecma262/#sec-patterns-static-semantics-character-value correctly), which is exactly the thing @BurntSushi didn't like up above. Could we have some specific examples, with context, of regexes using \! instead of bare ! please? I want to understand why they were written that way.

zackw avatar Oct 08 '18 17:10 zackw

@BurntSushi Regarding empty alternations, it might be a good idea to file an issue just so it's on record as a known problem and something you intend to support in the future.

zackw avatar Oct 08 '18 17:10 zackw

Aye. I opened #524.

BurntSushi avatar Oct 08 '18 17:10 BurntSushi

@zackw To be honest, I don't know why the regexes in the project I linked escapes !s, but here are the lines where it would escape it: https://github.com/ua-parser/uap-core/blob/23bfabe34b86f29f4840c9dd1ef6129e685581e3/regexes.yaml#L98-L100

KiChjang avatar Oct 08 '18 18:10 KiChjang

    [A-Za-z0-9 \-_\!\[\]:]*
    [A-Za-z0-9 _\!\[\]:]*

It's not at all clear to me what either of those character classes are supposed to do. And the context makes it sound like they're not supposed to be different, either, for added bafflement. I'm actually left wondering whether someone thought ! was a metacharacter within JS regex character classes (it isn't; it's a metacharacter within shell glob character classes, but that's a totally different ball of wax).

zackw avatar Oct 08 '18 20:10 zackw

Supporting " and / would solve all the cases where I had to manually (and carefully) unescape regexes before passing to Regex::new in several projects, so big 👍 here. I guess ' also makes sense, but never seen ! as being important.

More generally, I think it should be possible to allow no-op escapes for any non-ASCII-alphanumeric characters without breaking forward compatibility, but starting with a limited set is probably better for now.

RReverser avatar Oct 10 '18 17:10 RReverser

I wanted to mention that one of the reasons you might see 'no-op' escapes in the wild is that some languages' regex-escape functions produce them.

For example, Perl's quotemeta() escapes all non-word ASCII characters, and PHP's preg_quote() escapes all 'special' punctuation characters (even ones like ! that are only special when combined with an always-special character like ? that would be escaped anyway). Python's re.escape() used to work like PHP too, but it's been made more selective recently. I don't think JavaScript has a built-in escape function, but it does support look-around, so maybe there's some library that escapes ! for the same reason.

As far as patterns written out by humans, i'm just guessing, but i can think of two reasons they'd do it: (1) the author understands how the escaping works but chooses to rely on the 'no-op' feature so they don't have to remember which characters are special and when (not a bad reason imo), or (2) the author doesn't understand how the escaping works and simply cargo-cults it from the output of those functions, or from the first type of person.

okdana avatar Jul 27 '20 19:07 okdana

Is there an easy way to remove unnecessary escapes from a given string using this crate? If not, is there a list of characters that don't require escapes? I'm attempting to convert an existing PCRE-compatible regex to an expression that this crate can parse. Thanks!

anweiss avatar Oct 28 '20 15:10 anweiss

The crate certainly does not provide any such operation. It wouldn't really make sense to IMO.

As for a list of all meta characters, I think the only stable way to do that is to use is_meta_character from the regex-syntax crate. is_meta_character returns true only for characters that must be escaped in order to use their literal form. Other characters, such as ASCII space, can be escaped but do not need to be escaped.

Now, is_meta_character doesn't give you a list, but you can generate one by just trying all inputs. And since is_meta_character promises that the list will never expand or contract in a semver compatible release, the generated list will be stable. Or you could just use the fact that all meta characters are ASCII, so you only need to check 128 possible inputs instead of the full range of Unicode scalar values.

BurntSushi avatar Oct 28 '20 21:10 BurntSushi

Thanks @BurntSushi for the explanation! Super helpful! Will look at is_meta_character.

anweiss avatar Oct 29 '20 14:10 anweiss

TL;DR: If unrecognized escape sequences are an error, documentation should declare that no new bare metacharacters will be introduced.

I'd like to boost what @okdana said above: Languages like Perl have a simple mnemonic for escaping: "if you put a \ before any punctuation character then that character will not have a special meaning. If you don't put a \ before a letter or digit then that character will not have a special meaning." This allows a regex author who hasn't memorized the pattern specification to still write a "safe" (as in 100% accurate) regex, even if it's uglier than necessary (i.e. too many\ on punctuation characters that aren't metacharacters). A corollary of this principle is that any new regex syntax features in Perl-inspired dialects will either use a punctuation character without a backslash or will use a backslash with a letter.

Every regex engine that I've experimented with, except Rust and Vim[1], follows the "escaping punctuation always produces literal punctuation" rule of thumb. This includes POSIX[2], Perl, PCRE (and thus PHP and Erlang/Elixir), RE2 (including Go), Java, JavaScript/ECMAScript, and Python. This isn't to say that Rust must follow suit, but it does highlight the fact that Rust is definitely "the odd one out" on this front.

I appreciate @BurntSushi's desire to allow the regex semantics to grow in a backwards-compatible way. I think it's also important to provide users with clear guidance about how to write a forward-compatible regular expression. If Rust doesn't want to follow the "any punctuation character can be escaped to turn it into a literal" path, the documentation could state something like

The characters \.+*?()|[]{}^$ are treated as metacharacters unless escaped by \. All other characters will never become metacharacters on their own, but new \-prefixed escape sequences may be added. It is an error to use \ before a character which is not part of a valid escape sequence, to allow for future expansion. (Optional addition, proposed earlier in this issue:) The characters /'" (space, slash, single- and double-quotes) may be included with or without a \ escape and be treated as literals in either case.

In the absence of such a guarantee, a regex author is caught in a bit of a bind if they want to match punctuation literally: they can't escape it (due to the unrecognized escape sequence error) but they may worry that their punctuation character of interest will later become a new metacharacter. It also looks like Rust lacks a "quote literal" syntax (\Q and \E in Perl-derived dialects, \V and \v in Vim) which would provide a third option.

My motivating use case:
I discovered this issue because I'm writing a Vim plugin that generates regular expressions which are passed to a variety of command-line tools using different regex engines, including ripgrep which is how I learned of this divergent behavior in Rust. I'm okay with having different behaviors in regex engines (my plugin takes a style parameter and already handles multiple flavors). But it's not clear to me how to write a forward-compatible "escape literals" function for the Rust engine. I can't currently escape any non-metacharacters, but I don't know what might become a metacharacter in the future. And I don't think there's a way to expose which crate version rg was compiled against, so if a new metacharacter is introduced in the future then writing an "if version" check to work around it would be indirect at best.

The rejection of unknown punctuation escape sequences also means that programs which accept regular expressions as user input have an extra burden if they wish to migrate to Rust, since formerly valid user input would suddenly break. This is true to some degree anyway due to lack of support for backreferences and lookaround[3], but I suspect that over-escaped punctuation is more common than the other lost features, particularly since some other regex dialects (including RE2 for both and POSIX for lookaround) also don't support them.

[1] Vim is a special case, since the "magic" and "very magic" modes change the behavior of \ escapes; Vim also has some nontraditional punctuation metacharacters. [2] Extended POSIX RE syntax follows this principle; basic syntax requires \ escapes of metacharacters like (){} but seems to accept both \; and ; as matching a literal semicolon. [3] I am definitely not suggesting that Rust support those features; guaranteed linear time complexity is a much bigger win. "Back references are a dreadful botch," as Henry Spencer regex library docs say (this phrase appears in both modern BSD and Linux man pages).

flwyd avatar Nov 01 '22 08:11 flwyd

In the absence of such a guarantee, a regex author is caught in a bit of a bind if they want to match punctuation literally: they can't escape it (due to the unrecognized escape sequence error) but they may worry that their punctuation character of interest will later become a new metacharacter.

This sounds like the core of your concern, and I'm not quite sure where it's coming from:

  1. regex::escape is a thing, and it will always do the correct thing.
  2. Much of the point of making some escape sequences unrecognized is so that, some day, we might have the freedom to convert errors into valid regexes. Such a change is not breaking, and that's why it's okay. But changing already valid regexes into a different valid regex is totally a breaking change. It is exactly the thing I am trying to avoid! So certainly, nobody need worry that their "punctuation character of interest will later become a new metacharacter."

Indeed, if you look at regex-syntax::is_meta_character, it says:

Note that the set of characters for which this function returns true or false is fixed and won’t change in a semver compatible release.

In this case, "semver compatible release" seems to refer to regex-syntax, but since regex itself relies on this behavior to provide its API (the set of valid regexes), this actually can't change in a semver compatible release of regex itself.

I can't currently escape any non-metacharacters, but I don't know what might become a metacharacter in the future.

Of course you can. If I took an existing non-metacharacter and turned it into a meta character, then I would have to release regex 2.0.

The rejection of unknown punctuation escape sequences also means that programs which accept regular expressions as user input have an extra burden if they wish to migrate to Rust, since formerly valid user input would suddenly break. This is true to some degree anyway due to lack of support for backreferences and lookaround[3], but I suspect that over-escaped punctuation is more common than the other lost features, particularly since some other regex dialects (including RE2 for both and POSIX for lookaround) also don't support them.

As you mention, every regex engine has differences. And if the Rust regex crate made its escaping behavior match other regex engines (which one?), then it would do nothing to fix "formerly valid user input would suddenly break." It might fix it for some user inputs, but not all, and it never will. (Here's another example.) The only ways to expose two distinct regex engines and expect them to behave the same give the same input are to either ensure both conform to some spec (and don't try to support more than the spec), or to define your own syntax and transform it into the regex flavor of each engine in a way that produces identical behavior. (I'm not sure the latter is really possible, but I suppose it is in theory.)

You acknowledge as much I think, but for some reason still bemoan this particular difference. I hear you. That's why this issue exists. My plan is still to decree that some escape sequences that produce errors today will never obtain a special meaning (for example, \/) and permit the no-op escape in order to accept more valid regexes.

It also looks like Rust lacks a "quote literal" syntax (\Q and \E in Perl-derived dialects, \V and \v in Vim) which would provide a third option.

I would actually like to support something like \Q and \E, but it's not totally obvious to me how to do it safely. e.g., If you want to insert untrusted strings literally into a regex, the only safe and correct way to do it is with an escape routine, even if something like \Q...\E is available to you. If \Q...\E can't be used safely with untrusted literals, then that just seems like a footgun to me.

This allows a regex author who hasn't memorized the pattern specification to still write a "safe" (as in 100% accurate) regex, even if it's uglier than necessary (i.e. too many\ on punctuation characters that aren't metacharacters).

The regex crate has the same property. If you insert a superfluous backslash, you'll get an error. So there's no risk of messing it up. It either works or it doesn't. In fact, this has been a guiding principle of the syntax design in this crate. As I linked above, consider {. In other regex engines, when is it a literal { and when is it part of a repetition operator? ¯\_(ツ)_/¯

A corollary of this principle is that any new regex syntax features in Perl-inspired dialects will either use a punctuation character without a backslash or will use a backslash with a letter.

... no, that is absolutely not true! Consider % in PCRE2:

$ echo '%' | rg -P '%'
%
$ echo '%' | rg -P '\%'
%

That's a superfluous escape because % is not a meta-character despite being a punctuation character. If PCRE2 decided to turn it into a meta-character, the only option available to them without making a breaking change is to add a new flag in the syntax that changes the meaning of % (or \%). So for example, they could make (?o:\%) do something other than simply match % without it being a breaking change because the user has to opt into the behavior. But they absolutely cannot just up-and-change the meaning of either % and \%.

Now compare this with the behavior of the regex crate:

$ echo '%' | rg '%'    
%
$ echo '%' | rg '\%'
regex parse error:
    \%
    ^^
error: unrecognized escape sequence

The latter is an error, which means the regex crate has the freedom to change the meaning of \% in a future semver compatible release without breaking users. Why? Because it's not considered a breaking change to increase the size of the set of valid regexes. It is a breaking change to decrease the size of the set or change the meaning of a regex already in the set.

I'm afraid you have everything exactly backwards here.

Every regex engine that I've experimented with, except Rust and Vim[1]

You mention POSIX as an example of something that isn't like Rust or Vim, but POSIX BREs also have "different" escaping rules. It is only POSIX EREs that are similar to things like Perl and PCRE.

BurntSushi avatar Nov 01 '22 12:11 BurntSushi

In the absence of such a guarantee, a regex author is caught in a bit of a bind if they want to match punctuation literally: they can't escape it (due to the unrecognized escape sequence error) but they may worry that their punctuation character of interest will later become a new metacharacter.

This sounds like the core of your concern, and I'm not quite sure where it's coming from:

  1. regex::escape is a thing, and it will always do the correct thing.

The author of a regex doesn't necessarily have the ability to call Rust code, as in the example of user input to rg. In my case, I'm not programming in Rust, I'm programming in Vimscript and escaping user input so that it can be passed to whatever version of ripgrep happens to be installed on the user's system.

  1. Much of the point of making some escape sequences unrecognized is so that, some day, we might have the freedom to convert errors into valid regexes. Such a change is not breaking, and that's why it's okay. But changing already valid regexes into a different valid regex is totally a breaking change. It is exactly the thing I am trying to avoid! So certainly, nobody need worry that their "punctuation character of interest will later become a new metacharacter."

Indeed, if you look at regex-syntax::is_meta_character, it says:

Note that the set of characters for which this function returns true or false is fixed and won’t change in a semver compatible release.

In this case, "semver compatible release" seems to refer to regex-syntax, but since regex itself relies on this behavior to provide its API (the set of valid regexes), this actually can't change in a semver compatible release of regex itself.

That's a great commitment, and the sort of thing I was suggesting. I did not see that commitment in the syntax section of the regex crate or in the regex_synatx crate-level documentation. Elevating this metacharacter documentation to the top-level regex syntax docs would be a big step towards addressing my concern.

I can't currently escape any non-metacharacters, but I don't know what might become a metacharacter in the future.

Of course you can. If I took an existing non-metacharacter and turned it into a meta character, then I would have to release regex 2.0.

The concern I have with "won't change in a semver compatible release" is that the author of a regex is often disconnected from the library versioning process (i.e. they're writing a regex but not writing Rust code). If version 2.0 of the regex crate introduces ; as a new metacharacter, the author of a regex seeking to match ; as a literal needs a way to identify whether the Rust software they're using is compiled with a 1.x crate or a 2.x crate. With something like ripgrep, an intrepid shell scripter who notices that their "match end of Java statements" pattern breaks on the new version of rust can inspect rg --version. But the user of a Rust-based online service (maybe it's a configuration for matching lines from log files in the cloud) might not have a similar facility available. If the regex author was able to apply the simple heuristic "Escape any punctuation character in a literal, since backslash-punctuation won't later become a metacharacter" then they can future-proof their regex with \;. Alternatively, and compatible with your "unknown escapes are a syntax error" position, if they know that all future Rust regex metacharacters will start with a \ (even with semver changes) then they can leave all non-current metacharacters alone and know that it won't break in the future, even if their software vendor updates something deep in the bowels of their system.

As you mention, every regex engine has differences. And if the Rust regex crate made its escaping behavior match other regex engines (which one?), then it would do nothing to fix "formerly valid user input would suddenly break." It might fix it for some user inputs, but not all, and it never will. (Here's another example.)

I'm hypothesizing that allowing over-escaping would satisfy the larger side of an 80%/20% Pareto split on regular expressions in the wild. "I'm not sure what this character does, so I'll escape it" is, I suspect, a more likely regex author instinct than features like "quantifiers can be unescaped if they don't follow an atom".

A corollary of this principle is that any new regex syntax features in Perl-inspired dialects will either use a punctuation character without a backslash or will use a backslash with a letter.

... no, that is absolutely not true! Consider % in PCRE2:

That's a superfluous escape because % is not a meta-character despite being a punctuation character. If PCRE2 decided to turn it into a meta-character, the only option available to them without making a breaking change is to add a new flag in the syntax that changes the meaning of % (or \%).

I haven't been able to find documentation on syntax stability in PCRE, but I think the general understanding is that \% would never be introduced as a metacharacter, but it would be possible for % to be introduced as a breaking change in a future version, which is why regex authors are inclined to apply heuristics like "escape all literal punctuation." (Wikipedia points out this escaping feature but that claim isn't sourced.) In practice, it seems most advances in Perl-inspired regex formats have used backslash-letter forms, suggesting that a commitment to freeze the set of bare metacharacters may be a safe bet. And continuing to reject unknown backslash-alphanumeric sequences is wise.

The latter is an error, which means the regex crate has the freedom to change the meaning of \% in a future semver compatible release without breaking users. Why? Because it's not considered a breaking change to increase the size of the set of valid regexes. It is a breaking change to decrease the size of the set or change the meaning of a regex already in the set.

I'm more worried about the migration path between semver-breaking changes. If \% were allowed in 1.x then a new behavior for % in version 2.0 could be pre-announced and authors could change their bare % characters to \%. But without that, it becomes a two-stage process: authors must wait for a new 1.y release in order to migrate their %-using patterns while holding off upgrades that introduce a 2.0 dependency until that migration is done. And if the software they're using upgrades the crate straight from 1.x to 2.0 (without an intermediate depends-on-1.y version) then they need to change all of their regexes in the same atomic update as the one upgrading their dependency.

Every regex engine that I've experimented with, except Rust and Vim[1]

You mention POSIX as an example of something that isn't like Rust or Vim, but POSIX BREs also have "different" escaping rules. It is only POSIX EREs that are similar to things like Perl and PCRE.

Yeah, I attempted to address POSIX BREs in my second footnote, but could have been a lot clearer. A better way to express this is that BREs and Vim are the only other regex dialects I've encountered where there are metachars of the form backslash-punctuation. (And I think that Vim's behavior here is meant to preserve BRE compatibility with ex/ed.) All other regex dialects I've encountered have taken the position (though often implicitly) that metacharacters are either bare punctuation or backslash-alphanumeric. (Vim also provides a very-magic syntax mode so that the same escaping principle applies.) Additionally, Rust is the only regex dialect I've encountered which does not permit escaping punctuation that would not otherwise have special meaning. That is, other dialects (including BRE and Vim) all treat \; as a literal semicolon, even though BRE and Vim don't treat \( and \| as literal parenthesis and vertical bar. (Vim does provide a very-magic syntax mode so that the same backslash-punctuation-is-safe escaping principle applies and bare ( and | have their standard special meanings.)

To reiterate, while I think the "backslash any punctuation" feature is desirable in a regex engine (it makes patterns more portable and reduces the "know what language my software is written in" burden on users), I think a commitment to not introduce new bare metacharacters even with a semver change is also a reasonable solution.

flwyd avatar Nov 02 '22 06:11 flwyd

The author of a regex doesn't necessarily have the ability to call Rust code, as in the example of user input to rg. In my case, I'm not programming in Rust, I'm programming in Vimscript and escaping user input so that it can be passed to whatever version of ripgrep happens to be installed on the user's system.

You don't need to program in Rust to run a Rust program (or any program) that escapes a regex.

Elevating this metacharacter documentation to the top-level regex syntax docs would be a big step towards addressing my concern.

Yes, docs can always be improved. But it's also kind of weird to call this out, because it follows from semver. It would be like adding documentation that says, "we promise we won't flip the meanings of * and + in a semver compatible release." Like, sure, we could add that. And it would be true. And people would likely feel very relieved that such a crazy thing wouldn't happen. But... there's an unbounded number of crazy breaking changes that won't happen in a semver compatible release.

The concern I have with "won't change in a semver compatible release" is that the author of a regex is often disconnected from the library versioning process (i.e. they're writing a regex but not writing Rust code). If version 2.0 of the regex crate introduces ; as a new metacharacter, the author of a regex seeking to match ; as a literal needs a way to identify whether the Rust software they're using is compiled with a 1.x crate or a 2.x crate. With something like ripgrep, an intrepid shell scripter who notices that their "match end of Java statements" pattern breaks on the new version of rust can inspect rg --version. But the user of a Rust-based online service (maybe it's a configuration for matching lines from log files in the cloud) might not have a similar facility available. If the regex author was able to apply the simple heuristic "Escape any punctuation character in a literal, since backslash-punctuation won't later become a metacharacter" then they can future-proof their regex with \;. Alternatively, and compatible with your "unknown escapes are a syntax error" position, if they know that all future Rust regex metacharacters will start with a \ (even with semver changes) then they can leave all non-current metacharacters alone and know that it won't break in the future, even if their software vendor updates something deep in the bowels of their system.

I don't share this concern for several reasons:

  1. There are no current plans for a regex 2.0.
  2. If regex 2.0 exists, and if a dependent exposes the syntax as part of their API, then the responsibility falls on them to manage that migration, if at all.
  3. The actual likelihood of making a new punctuation character a meta-character is extremely low, bordering on zero. There are no active discussions proposing such a thing, and if a new meta-character was desired, it's very likely that it will be niche and thus more than okay to introduce as an escape sequence.

When you combine all of that together, this just comes across as tilting at windmills to me.

I'm more worried about the migration path between semver-breaking changes. If \% were allowed in 1.x then a new behavior for % in version 2.0 could be pre-announced and authors could change their bare % characters to \%. But without that, it becomes a two-stage process: authors must wait for a new 1.y release in order to migrate their %-using patterns while holding off upgrades that introduce a 2.0 dependency until that migration is done. And if the software they're using upgrades the crate straight from 1.x to 2.0 (without an intermediate depends-on-1.y version) then they need to change all of their regexes in the same atomic update as the one upgrading their dependency.

This doesn't need to be considered in the abstract. If a regex 2.0 was released that turned % into a meta-character, then there would be a 1.x release that made it legal to escape %.

I think a commitment to not introduce new bare metacharacters even with a semver change is also a reasonable solution.

The only rock solid commitments I'm comfortable making are:

  1. I won't introduce breaking changes (according to Rust's API evolution RFC) in semver compatible releases.
  2. Semver breaking releases will be done conservatively, if at all.

With that said, I think it is very unlikely that new bare meta-characters will be introduced. Regardless of escaping behavior, introducing a new meta character in an established syntax that isn't opt-in is a very very disruptive change. And in order to do it, there would have to be a very compelling reason to do it.

The sort of breaking changes I imagine might happen to the syntax, if at all, are things like "^ and $ is are now aware of Unicode line terminators by default."

Rust is the only regex dialect I've encountered which does not permit escaping punctuation that would not otherwise have special meaning.

I think we've come full circle. That's the entire point of this issue: to permit some no-op escape sequences. Maybe it does indeed make sense to permit them for all punctuation characters.

I feel like we just spent a lot of words only to agree that this issue is a good idea.....

BurntSushi avatar Nov 02 '22 13:11 BurntSushi

Elevating this metacharacter documentation to the top-level regex syntax docs would be a big step towards addressing my concern.

Yes, docs can always be improved. But it's also kind of weird to call this out, because it follows from semver. It would be like adding documentation that says, "we promise we won't flip the meanings of * and + in a semver compatible release." Like, sure, we could add that. And it would be true. And people would likely feel very relieved that such a crazy thing wouldn't happen. But... there's an unbounded number of crazy breaking changes that won't happen in a semver compatible release.

I think we may have different users and user journeys in mind here. I know little about the Rust language and almost nothing about the crate ecosystem. I did a Google search for "rust regex syntax" and read the regex and regex_syntax pages, looking for an answer to the question "what's the safes way to escape a literal, since the approach I'm used to in other languages doesn't work." As a newecomer it wasn't clear to me that semver semantics are in play or what the Rust community expectations are for user-facing programs when something like the regex library changes. In most languages with regular expression support in the standard library, changes to RE semantics are a byproduct of the runtime or compiler version, hence the desire of regex authors to escape defensively.

Regular expression syntax has a tendency to become exposed to end users in ways that something like a math library doesn't, so I think it's valuable for a language's regex docs to be a little more detailed so that it's accessible to language newcomers. The java.util.Pattern javadoc provides a lot of detail, including what can and can't be escaped. The pcre man page is even longer and also explains the punctuation-escaping behavior.

"We won't swap the meaning of * and +" doesn't need to be stated because doing so would be very surprising in any context (semver or not) to anyone with even a cursory understanding of regular expressions. "There are punctuation characters which cannot be escaped", on the other hand, is surprising to many regex users (as evidenced by commenters on this issue), so it seems valuable to explain the assumptions that users can make about punctuation characters and escape sequences so they can resolve that surprise. (Similarly, a language specification doesn't need to state "we won't change the semantics of if, but might helpfully state that no future reserved word would include an underscore or capital letter, so coders can know how to name an identifier that's guaranteed not to be a syntax error in the future.)

I think we've come full circle. That's the entire point of this issue: to permit some no-op escape sequences. Maybe it does indeed make sense to permit them for all punctuation characters.

I feel like we just spent a lot of words only to agree that this issue is a good idea.....

I chimed in because I saw the issue had been open for four years and seemed to have gotten as far as "maybe we'll support some escapes some day", so I wanted to provide further support and motivation in the case for "all punctuation." Thanks for listening.

flwyd avatar Nov 03 '22 05:11 flwyd

Gotya. That's all very fair. I've been heads down for the last few years focused on regex internals (and a toddler). I hope to get back to some of the user facing parts of the regex crate soon.

As of right now, it feels like just permitting all punctuation characters to be escaped is probably the best path forward. I'm still not a huge fan of no-op escapes because it does IMO result in regexes that are harder to read. For example, I've seen so many people write things like http:\/\/ that are totally unnecessary (in most contexts). I am also hesitant because a lot of people come at this with "well it works in other regex engines and I want to be able to use the same regexes I use in other regex engines here." But this seems like a bad thing, because this is just a surface level and trivial difference between regex engines. Assuming one regex will work the same across multiple regex engines is much more subtle than that.

... but it does seem like practical concerns probably carry the day here. Forbidding no-op escapes is unlikely to solve (or even make meaningful progress on) the problem of users assuming regex engines behave the same. And there are, after all, many ways to write a regex that is unclear.

One thing I did just remember: \< and \> are "left" and "right" word boundary assertions in POSIX EREs IIRC. And < is considered punctuation. And adding support for those is something that could conceivably happen. And if we allowed it as a no-op escape sequence, it could be confusing to folks that expect it to be an assertion. So I suppose we should exclude < and > from this, but perhaps all other ASCII punctuation should be escapeable. (But I'll do a final audit before that happens.)

BurntSushi avatar Nov 03 '22 12:11 BurntSushi

I think many people would really enjoy an option to allow no-op escapes for compatibility. It doesn't have to be default, just an option for those who need it. Our project uses a third party project with regex that we cannot change upstream (as it would break them for other people). So our only options are to use pcre2 which is slightly slower or filter out the escapes which is kind of hackish and would like to avoid.)

YamatoSecurity avatar Dec 01 '22 00:12 YamatoSecurity