regex icon indicating copy to clipboard operation
regex copied to clipboard

syntax: accept `{,n}` as an equivalent to `{0,n}`

Open plusvic opened this issue 2 years ago • 4 comments

Most regular expression engines don't accept the {,n} syntax, but some other do it (namely Python's re library). This introduces a new parser configuration option that enables the {,n} syntax as equivalent to {0,n}. This option is disabled by default and not exposed in the regex create.

I understand this change may imply a deviation from the goals of the regex-syntax create. If the purpose of the regex-syntax crate is exposing the parser used by the regex create and supporting a very specific regex syntax flavor, then this PR can be closed. However, I'm bringing this PR into consideration because I'm using regex-syntax in project that requires support for the {,n} syntax for backward-compatibility with existing regular expressions. The alternative for me is maintaining a fork of regex-syntax forever.

Also, I think empty_min_range is not a very good name, but I couldn't come up with a better name. Better naming alternatives are welcomed, even if this PR is not finally merged.

plusvic avatar Sep 07 '23 17:09 plusvic

The alternative for me is maintaining a fork of regex-syntax forever.

Out of curiosity, is this the only additional syntax you need to support and everything else is in sync? I would be surprised if so.

I ask this because if this is literally the only thing, then I can see how "maintain a fork of regex-syntax forever" would be a somewhat significant motivation to merge this or something like it. But if you're going to need to maintain a fork forever anyway because of other incompatibilities, then it might make more sense to just close this.

Another possible route here is to just unconditionally enable this and support it in regex proper. Namely, {,n} is invalid syntax today. I think the reason why I made it invalid despite supporting {n,} syntax is that in the latter version, it's not possible to express otherwise (since it's unbounded). But in the former version, you can write an equivalent {0,n} which I think reads better.

BurntSushi avatar Sep 07 '23 17:09 BurntSushi

Actually there are a few other incompatibilities, but either they are less prevalent in real-life regexps, or they are actually benefitial in the long term.

One case is the support for the unscaped { character in a context where it is not a repetition. In other regexps engines this regexp is ok {abc}, { and } behaves here as a literals. And that's the behaviour my legacy program had. However, I'm observing that are a lot of places where people are using { as literals in cases where they didn't intent it, with regexps like (foo){1-3}, where the real intention was (foo){1,3} (but they didn't notice because (foo){1-3} is perfectly valid). So, being more strict and forcing { to be escaped makes sense.

In the case of {,n} however, it's much more widespread, and there's no real benefit in dropping its support. Users will consider it a nuisance, and there's no good argument in favor of the removal. That's why I decided to support it. But my plan is using regex-syntax with as few changes as possible, and so far this is the only change I've identified.

Maybe we can simply hold on this change until I advance a little more. If I eventually need some changes like this one, then I must go through the fork path, if not, I can retake this PR and even provide the full implementation for {,n} in regexp if you think it makes more sense. I was so used to Python's regexp that I got suprised that this is not supported by most of the regexp engines out there.

plusvic avatar Sep 07 '23 18:09 plusvic

Yeah I mean I would argue that the benefit here is that {0,n} is easier to read. But I understand the nuisance argument. That's ultimately why I ended up relenting and allowing superfluous escapes of non-meta punctuation characters (e.g., \/ used to be invalid syntax). As far as things like (foo){1-3} go, yeah things like that were on my mind by requiring that { be escaped.

I like the idea of holding off until you advance a bit more. If this turns out to literally be the only thing you need then I might be able to get on board with that.

BurntSushi avatar Sep 07 '23 19:09 BurntSushi

@BurntSushi I would like to rescue this PR for your consideration again. After using regex-syntax for a while I haven't identified any additional change that I would need. The only feature that I like to include is support for the {,n} syntax, either as an opt-in feature or as the default behaviour.

This PR adds support for {,n} as an opt-in via ParserBuilder::new().empty_min_range(true), but if you prefer I can remove the configuration option and make it the default. I'm still unconvinced about the empty_min_range name (maybe allow_empty_min_range is better?)

A side-effect is that some errors change from RepetitionCountDecimalEmpty to RepetitionCountUnclosed (even when this feature disabled), but I would say that it's also a good thing, as RepetitionCountUnclosed describes the situation better when the regex contains something like foo.{.

plusvic avatar Feb 01 '24 11:02 plusvic

@BurntSushi any comments on this?

plusvic avatar Mar 15 '24 16:03 plusvic

Hi @BurntSushi,

I would like to publish the crate I was working on, which depends on a modified version of regex-syntax that includes this change. So far, I was using a git dependency in my Cargo.toml file, pointing to my clone of this repository. However, I just discovered that I can't publish a crate that has git dependencies, as all published crates must depend on crates that are also published in crates.io. So, either I publish my own clone of regex-syntax (which doesn't make too much as it would clutter crates.io), or I include the whole regex-syntax source code inside my own crate, which I would like to avoid if possible.

If you are ok with merging this change that would save my day, if not, I think next my best option is including regex-syntax as part of my own crate. I just need some feedback from you so I can make an informed decision on how to proceed.

Not being allowed to publish a crate that depends on some public GitHub repository (even if not published in crates.io) was quite unexpected too me :(

plusvic avatar Mar 26 '24 17:03 plusvic

Yeah, sorry about my absence here. I've been focused on other things. I think my main decision point here is whether I just want to enable it for everyone (so don't make it conditional), or to keep it as you have here via an option. I do kind of feel like overall the syntax {,n} is an aberration and it would be better to not have it, which means I think it should be an option like you have here.

BurntSushi avatar Mar 26 '24 17:03 BurntSushi

This is on crates.io in regex-syntax 0.8.3.

BurntSushi avatar Mar 26 '24 17:03 BurntSushi

That was quick! Thank you so much.

plusvic avatar Mar 26 '24 18:03 plusvic