cl-ppcre icon indicating copy to clipboard operation
cl-ppcre copied to clipboard

Named back-references validated too early

Open nbtrap opened this issue 11 years ago • 12 comments

CL-PPCRE validates register names too early in the regex compilation phase. The converter should wait until it has seen all register names before asserting that named back-references refer to existing registers.

(setf *allow-named-registers* t)
==> T

(scan "\\k<regOne>(?<regOne>.)" "a")
; Evaluation aborted on #<PPCRE-SYNTAX-ERROR "Illegal back-reference: ~S." {1006D50F93}>.

(scan "\\1(.)" "a")
==> NIL

Compare this to Perl:

say 'a' =~ /\k<regOne>(?<regOne>.)/ ? 'true' : 'false';
==> false

say 'a' =~ /\k<regTwo>(?<regOne>.)/ ? 'true' : 'false';
==> Reference to nonexistent named group in regex...

I plan on fixing this in my subpattern references branch, which is nearing completion.

nbtrap avatar Feb 21 '14 02:02 nbtrap

Why does perl's behavior make sense?

2014-02-21 3:26 GMT+01:00 nbtrap [email protected]:

CL-PPCRE validates register names too early in the regex compilation phase. The converter should wait until it has seen all register names before asserting that named back-references refer to existing registers.

(setf allow-named-registers t)==> T (scan "\k<regOne>(?<regOne>.)" "a"); Evaluation aborted on #<PPCRE-SYNTAX-ERROR "Illegal back-reference: ~S." {1006D50F93}>. (scan "\1(.)" "a")==> NIL

Compare this to Perl:

say 'a' =~ /\k<regOne>(?<regOne>.)/ ? 'true' : 'false';==> false say 'a' =~ /\k<regTwo>(?<regOne>.)/ ? 'true' : 'false';==> Reference to nonexistent named group in regex...

I plan on fixing this in my subpattern references branch, which is nearing completion.

Reply to this email directly or view it on GitHubhttps://github.com/edicl/cl-ppcre/issues/17 .

hanshuebner avatar Feb 21 '14 05:02 hanshuebner

I can think of two reasons:

  1. Perl allows you to incrementally build regexes. For example, you can do this:

    $rx1 = qr/\k<foobar>/
    $rx2 = qr/(?<foobar>.)$rx1/
    

    If backreference names were validated as early as they are in CL-PPCRE, the first qr would be impossible. However, CL-PPCRE doesn't have anything like this, so it's not relevant apart from maintaining Perl compatibility

  2. Subpattern references make forward and self-referential back references meaningful. For example, in Perl, you can do things like this:

    print 'abcb' =~ /^(.\k<regTwo>?)(?<regTwo>.)(?1)$/ ? "true\n" : "false\n"
    => true
    

    In that example, the \k<regTwo> fails on the first encounter, but the matching phase continues thanks to the ?. Then, inside the (?1) subpattern call, the \k<regTwo> refers to what the (?<regTwo>.) matched up on the call stack, namely, "b", and it succeeds. CL-PPCRE currently doesn't support subpattern references, but I'm nearly finished implementing them and will be submitting a merge request in the near future.

By the way, Perl itself doesn't always handle things like the second example above correctly, but the developers have confirmed that it's a bug.

nbtrap avatar Feb 22 '14 02:02 nbtrap

Crap. I didn't know CL-PPCRE tries to match named backreferences against every register with a given name. This complicates things. Here are the options as I see them:

  1. Extend the current behavior to include matching registers that come after the backreference.
  2. Break backwards compatibility, and behave like Perl.
  3. Leave things the way they are.

My favorite is 2. Do you think anyone relies on this behavior? We should try to conform to Perl as much as possible. (Come to think of it, what's the deal with *ALLOW-NAMED-REGISTERS*? Seems unnecessary. I'd be happy to remove it too.)

3 is a bad idea if we're going to merge subpattern references, so I think 1 is the other viable option.

nbtrap avatar Feb 22 '14 14:02 nbtrap

I have no opinion regarding this. Maybe Edi has one.

-Hans

2014-02-22 15:12 GMT+01:00 nbtrap [email protected]:

Crap. I didn't know CL-PPCRE tries to match named backreferences against every register with a given name. This complicates things. Here are the options as I see them:

Extend the current behavior to include matching registers that come after the backreference. 2.

Break backwards compatibility, and behave like Perl. 3.

Leave things the way they are.

My favorite is 2. Do you think anyone relies on this behavior? We should try to conform to Perl as much as possible. (Come to think of it, what's the deal with ALLOW-NAMED-REGISTERS? Seems unnecessary. I'd be happy to remove it too.)

3 is a bad idea if we're going to merge subpattern references, so I think 1 is the other viable option.

Reply to this email directly or view it on GitHubhttps://github.com/edicl/cl-ppcre/issues/17#issuecomment-35803448 .

hanshuebner avatar Feb 22 '14 14:02 hanshuebner

Are you in a position to ask him?

nbtrap avatar Feb 24 '14 01:02 nbtrap

I Cc'd him in my previous comment. Am 24.02.2014 02:22 schrieb "nbtrap" [email protected]:

Are you in a position to ask him?

Reply to this email directly or view it on GitHubhttps://github.com/edicl/cl-ppcre/issues/17#issuecomment-35851033 .

hanshuebner avatar Feb 24 '14 05:02 hanshuebner

Named registers came in late and were modeled after AllegroCL's named registers, not after Perl's.

I don't think ALLOW-NAMED-REGISTERS is unnecessary. Changing its value obviously changes the semantics of some regular expressions.

While CL-PPCRE is based on Perl's syntax, I wouldn't get out of the way to copy each of its features. There are several regex variants out there which are close to but not similar to what Perl does and people seem to be able to cope with it.

I generally think that being backwards compatible is a good thing. Just because you haven't used a feature, that doesn't mean nobody else has. In my private code repository I count almost 100 commercial Lisp projects I've done in the last decade (and that doesn't include code with its own repository). Much of this code is still in use and sometimes there are requirements to extend it or to fix something. I'd rather have code that, say, "uses CL-PPCRE" and not "uses CL-PPCRE, but can only use it up to version x.y.z".

Having said that, the most important feature for me in case of any changes would be clear documentation.

nhabedi avatar Feb 26 '14 08:02 nhabedi

To be clear, I wasn't saying named registers are unnecessary. What I am saying is that it's unnecessary to have to bind a variable to use them. Deprecating the *ALLOW-NAMED-REGISTERS* variable would not break backwards compatibility. It would merely make using named registers easier.

Rather, the feature that I was talking about changing was this idea that named back-references can refer to multiple registers, whereas numbered registers have only one referent. The whole point of named registers, in my mind, is to help more clearly disambiguate registers when the regex contains so many as might otherwise confuse the person reading the code. That being the case, I don't think named registers should have different semantics from regular registers.

I agree with what you say about not having to be like Perl in every respect. In implementing subpattern references, I discovered that Perl does not have clearly defined semantics for that feature. I'm trying to convince them to adopt the definition I've implemented (they seem rather open to it), but if they reject it, I'll still probably stick with what I already have.

nbtrap avatar Feb 26 '14 11:02 nbtrap

Deprecating ALLOW-NAMED-REGISTERS would break backwards compatibility in two ways:

  1. Every piece of code that uses CL-PPCRE and sets ALLOWED-NAMED-REGISTERS to a true value has to be changed because the variable doesn't exist anymore.
  2. As I already said, the semantics of some regular expressions will be changed. Try to parse "\k" (from the documentation) for example with and without ALLOWED-NAMED-REGISTERS being NIL.

As to your second point: With the current behavior, you can for example use a regex like "(?[1-3])..(?[4-6])\" to match "12341" as well as "12344". I don't think I've ever used that, but my experience from 15 years of open source development tells me that just because you and I can't come up with a use case for this feature doesn't mean nobody else ever came up with one... :)

nhabedi avatar Feb 26 '14 12:02 nhabedi

  1. I don't want to remove *ALLOW-NAMED-REGISTERS*, I'm just suggesting allowing named registers regardless of its value.
  2. Yes, deprecating *ALLOW-NAMED-REGISTERS* might break anomalous code that relies on PARSE-STRING throwing an exception because *ALLOW-NAMED-REGISTERS* is not bound to T. I think that's a stretch though. (Technically, you could argue that such code is relying on a non-api anyway. There's nothing in the documentation that describes how *ALLOW-NAMED-REGISTERS* is handled. It only says that binding it to T will allow you to use named regsiters.)
  3. I don't understand the point you try to make at then end. Check your syntax. I think you meant to write something else.

nbtrap avatar Feb 26 '14 13:02 nbtrap

That said, I don't want to get carried away with the *ALLOW-NAMED-REGISTERS* bit. This issue is about how CL-PPCRE handles named backreferences.

nbtrap avatar Feb 26 '14 13:02 nbtrap

I have updated my previous comment. I forgot a couple of backslashes for GitHub's markdown.

Otherwise, I've said what I wanted to say on this issue. The discussion is getting a bit too ridiculous for my taste.

nhabedi avatar Feb 26 '14 15:02 nhabedi