perl5 fixup many regexp bugs - not to be merged yet

This fixes a number of longstanding regexp bugs and issues. It includes debugging changes and other stuff i will rebase away.

The main point is that we use a two layer model to represent capture buffers, and implement a concept of "commiting" the changes, and we store additional data in the CURLY nodes so we can manage the parens they contain properly. This is likely not the most efficient way to do this. The idea was to be correct first, then optimize.

The intended semantics are that: when we have a quantified group we keep track of the parens that it contains, and we ensure that all capture buffers it does contain come from a single iteration of the quantifier.

Consider

./perl -le'"barfoobar"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&";'
bar--bar=barfoobar

In older perls:

perl -le'"barfoobar"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&";'
bar-foo-bar=barfoobar

You can see that in older perls that $2 is from the second iteration, and $3 is from the third.

This patch currently doesn't entirely fix this (yet):

./perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&";'
a|a--|a-|a=abcaba

where arguably $2 and $5 should be empty, as they werent part of the last succesful match. Currently I have only fixed CURLYX -> CURLYM. And what should happen here is a bit controversial, you could argue this is correct.

However even for this case we have a change from older perls:

perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&";'
a|a-b-c|a-b|a=abcaba

which really highlights the underlying bug, we have capture buffers here from the first, second and third iteration of the +. IMO that doesn't make sense, and it is inconsistent with the behavior of capture buffers in other contexts.

This PR patch also fixes some other related issues, for instance the min and max have been changed and REG_INFINITY is now I32_MAX, as opposed to U16_MAX. This means that CURLYM and CURLYX now behave the same.

This is intended to fix https://github.com/Perl/perl5/issues/20661 and related bugs.

Fixes #20661 Fixes #8267 Fixes #19615 Fixes #18865 Fixes #14848 Fixes #13619 Fixes #10073

Jan 06 '23 16:01 demerphq

https://github.com/Perl/perl5/issues/20661

Jan 06 '23 16:01 demerphq

This now works as I believe is expected, and matches PCRE etc. I am pretty sure it is nothing as elegant and efficient as it could be using the regexp state machine, but I do believe it is now correct. You have to start somewhere. :-)

The important gist of the changes are:

The offset data has been doubled to hold a "start/end" and "start_new/end_new" pairs. OPEN/CLOSE etc update start_new/end_new. All of the buffer read logic has been changed to check both, first the "_new" variants, then the standard ones. At the and of a successful iteration of the quantifier we "commit" the state of the _new data to the standard data. Prior to a new iteration we clear the _new data. All CURLY style operators now store the min/max as I32's and they store the "first paren" and "last paren" of their contents.

All branches now store the number of parens before them, and contained by them. As do tries. After a failed branch we clear the _new data for the parens it contains. (As well as whatever else we did historically).

There are a few other minor changes embedded into these patches which i will tease out, but that is the main point. @iabyn if you have suggestions on how to make it more efficient I would love to hear them.

Jan 07 '23 18:01 demerphq

@jkeenan why did you add a "do not merge" to this /draft/ PR? Doesn't that kinda defeat the point of it being draft?

Jan 07 '23 22:01 demerphq

On 1/7/23 17:04, Yves Orton wrote:

@jkeenan https://github.com/jkeenan why did you add a "do not merge" to this /draft/ PR? Doesn't that kinda defeat the point of it being draft?

I don't think you need "do not merge" in the Subject. But do whatever you want.

Jan 07 '23 22:01 jkeenan

@kid51 the point is that is a note to me, before I change it from draft ill probably split some of those patches apart, and I will certainly remove a bunch of debug crap. I wont change it from draft until I have done so.I thought do-not-merge is for ready PR's that we are sitting on till a later release cycle. Anyway, it doesn't matter, it just struck me as redundant.

Jan 07 '23 23:01 demerphq

I've had a look over it, but I'm not especially familiar with the regexp engine.

Jan 09 '23 00:01 tonycoz

I've had a look over it, but I'm not especially familiar with the regexp engine.

Your feedback is already super useful, thanks. I need to think on what to do about (??{{ }}) and friends. I had hoped this trick with the extra block would Just Work. Now I need to think about this. How important is the local thing? If someone is using (?{{}}) maybe they dont want to use local. I have a feeling this is hopeless optimism and I just need to figure out an alternative. Which means messing with the toker, which I had really wanted to avoid. That code is gnarly.

Jan 09 '23 08:01 demerphq

On Sat, Jan 07, 2023 at 10:01:15AM -0800, Yves Orton wrote:

The offset data has been doubled to hold a "start/end" and "start_new/end_new" pairs. OPEN/CLOSE etc update start_new/end_new. All of the buffer read logic has been changed to check both, first the "_new" variants, then the standard ones. At the and of a successful iteration of the quantifier we "commit" the state of the _new data to the standard data. Prior to a new iteration we clear the _new data. All CURLY style operators now store the min/max as I32's and they store the "first paren" and "last paren" of their contents.

I have zero enthusiasm to review this currently. This is because its a big patch, with practically no explanation of what fundamental problems it's trying to fix, and what its doing to fix them.

-- The optimist believes that he lives in the best of all possible worlds. As does the pessimist.

Jan 09 '23 11:01 iabyn

@iabyn - fair enough. I thought I had explained well enough in the PR documentation. I didnt bother in the commits because I am fairly certain I am not doing things efficiently and I would like your advice on how to do them efficiently so i expect to rip them up after I get some of your advice. What I have written in the text to this PR is more relevant generally than the code. But clearly I have not explained myself sufficiently.

I am trying to solve the issues of clearing capture buffers properly, and setting them properly inside of quantified expressions, especially I am trying to resolve the issues that come from CURLYX and CURLYN and CURLYM not being equivalent. The code says that converting CURLYX to CURLYM or CURLYN is a "powerful optimization", which is correct in a sense, but it is not purely an optimization as it turns out CURLYM and CURLN do things "right" as far as capture buffers go, CURLYX does not. Which is problematic, as we disable these optimizations under various circumstances. Meaning a minor change to the pattern can result in the state of the capture buffers afterwards being different.

Essentially there are two core bugs here. They both are solved using the same underlying machinery, teaching the relevant regops to know how many parens are bfore them, and how many are defined within them, a macro called CAPTURE_CLEAR() and a macro called CAPTURE_COMMIT(), and the addition of "transactional transparent capture buffers" which are committed only when a quantifier iteration is successful.

BUG 1

perl -le'"foobarfoo"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&"' #wrong
foo-foo-bar=foobarfoo

This should not leave $2 AND $3 set at the same time. Either $3 should be undef, and $1 should equal $2. Or, $2 should be undef and $1 should equal $3. It should work like the branch does:

./perl -le'"foobarfoo"=~/((foo)|(bar))+/ and print "$1-$2-$3=$&"' #right
foo-foo-=foobarfoo

SOLUTION 1 I have added the idea of "temporary offset data" which is "committed" when a WHILEM or END executes.. And I have taught CURLY style regops to store the number of parens before them, and the number which are contained by them so that the CURLY/WHILEM can reset them appropaitely each loop.

For 1 this means that struct regexp_paren_pair gets two new members, start_new and end_new. Like this:

typedef struct regexp_paren_pair {
    SSize_t start;
    SSize_t end;

    SSize_t start_new;
    SSize_t end_new;

    /* 'start_tmp' records a new opening position before the matching end
     * has been found, so that the old start and end values are still
     * valid, e.g.
     *    "abc" =~ /(.(?{print "[$1]"}))+/
     *outputs [][a][b]
     * This field is not part of the API.  */
    SSize_t start_tmp;
} regexp_paren_pair;

The logic that reads offset data has been changed to know about the start_new,end_new variants, and to check them first. When they are -1 the code then looks at the start,end pair. Thus the _new buffers are "transparent", when not set you can see the old value. They are transactional because the OPEN/CLOSE logic writes into the _new buffers, and the value of the _new buffer is only "committed" to the standard buffer at certain times (WHILEM, END and GOSUB).

Each loop of a quantifier is treated as a transaction. Either it succeeds in which case the state is committed, or it fails in which case it is cleared. The overall pattern is treated as a loop as with 1 iteration, so END is also a commit.

So for instance in "foobarfoo"=~/((foo)|(bar))+/ we first match foo, so offs[2].start_new = 0, offs[2].end_new=3, we hit the end of the + loop, and commit the data for $2 and $3, so now offs[2].start = 0, and offs[2].end = 3 as well, offs[3] stays -1 in both data. We then clear offs[2].start_new and offs[2].end_new by setting them to -1. We execute the loop again, and this time we match 'bar', so, we end up committing $2=(-1,-1) $3=(3,7);, We execute the loop again, and this time we mach 'foo' again, and so we end up committing $2=(7,10), $3=(-1,-1). We execute it a third time, which fails to match either, and we are done.

With this CURLYX and CURLYM and CURLYN behave the same.

BUG 2 The other problem I am trying to solve is capture buffers from multiple branches of a pattern being set after an alternation. The capture buffers defined in different branches of an alternation should be mutexed against each other (leaving aside branch reset patterns). So this is wrong:

perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&"'; # wrong
a|a-b-c|a-b|a=abcaba

This should result in either $2 $3 and $4 being set, or $5 and $6, or $7. And $1 should either equal "$2$3$4" or "$5$6" or $7. Like this:

./perl -le'"abcaba"=~/((a)(b)(c)|(a)(b)|(a))+/ and print "$1|$2-$3-$4|$5-$6|$7=$&"'; # right
a|--|-|a=abcaba

SOLUTION 2 I have added the idea that a BRANCH node should clear the buffers it contains when the branch fails. So I have taught the BRANCH logic to store the number of parens before them, and then number which are contained by them as well.

For problem 2, we do something similar. Each BRANCH knows how many capture buffers are contained in its branch, and on failure clears those capture buffers (again by using CAPTURE_CLEAR() and clearing the _new state before they execute their payload. so in /(a)(b)(c)|(a)(b)|(a)/ we end up with: BRANCH(0,3) OPEN1 ... OPEN2 ... OPEN3 BRANCH(3,5) OPEN4 ... OPEN5 ... BRANCH(5,6) OPEN6 ... END

So the first branch when it fails clears buffer 1..3, the second when it fails clears 4..5, and when the third branch fails it clears 6.

The end result is that the state of capture buffers in branches is mutexed.

WHAT I NEED FROM YOU Everything else in this PR is a minor fix and I don't need your help with. What I really need is your insight into how to exploit the state machine to do the above stuff efficiently, and not have to loop over all the capture buffers for instance to clear or commit them. Or any other suggestions you might have to resolve the above bugs. I am satisfied my PR has the correct semantics and consistent behavior, I just need advice on how to make it efficient. I think maybe I can get rid of a bunch of this patch, maybe even most of it by using the information I have added to the regops in terms of buffers before and contained, to update maxopenparen properly. But I am not entirely sure if i can, GOSUB and other non-linear pattern constructs seem to suggest that wont work. Consider "abac" /((a)|(b))+(c)/, you cant just set maxopenparen to 2 after the last iteration of the loop, as when (c) matches it will get set back to 4, and if the buffer for $3 has not been cleared the value from the previous iteration of the quantifier loop will be visible again.

Jan 09 '23 12:01 demerphq

I have updated the PR by breaking up the main changes in the PR. The two patches I would really like some feed back from @iabyn are https://github.com/Perl/perl5/pull/20677/commits/ff43150a169745a295340ff18960efbbfa839be4 and https://github.com/Perl/perl5/pull/20677/commits/99429446d08cb71cbcb3eb05d36d88e6a95a0c86

Jan 10 '23 00:01 demerphq

@PhilipHazel could you please check this over? This is an attempt to make Perls capture buffers behave more consistently. It hopefully brings PCRE and Perl into agreement in more cases.

Jan 11 '23 09:01 demerphq

Patches from this PR have been pushed independently as: https://github.com/Perl/perl5/pull/20703 https://github.com/Perl/perl5/pull/20704 https://github.com/Perl/perl5/pull/20705 https://github.com/Perl/perl5/pull/20706 https://github.com/Perl/perl5/pull/20707

As they get merged I will rebase them away from this PR. For now they are all bunched together.

Jan 14 '23 23:01 demerphq

This looks like a really nice enhancement.

The only thing I saw is that some of the values are cast, either redundantly, or inconsistently. So arg= (U32)ARG1u(scan); would be neater without the cast, and utmp = (U32)ARG2i(scan);

The former can obviously be removed. The latter however I am not sure about. So on one hand, why not just use the struct api. On the other hand, we have cases where something is being stored as a signed value, and then read back as a signed value and cast to match an external variable, presumably because they can rule out the negative cases. It is not clear to me that we should change these or not. You could argue that if the creator of the object thought the value was signed, or unsigned, then the fetcher should use the corresponding accessor, and the cast is actually a good thing.

I cant decide really. Maybe in the end it doesn't matter much.

Jan 15 '23 18:01 demerphq

@demerphq

WHAT I NEED FROM YOU Everything else in this PR is a minor fix and I don't need your help with. What I really need is your insight into how to exploit the state machine to do the above stuff efficiently, and not have to loop over all the capture buffers for instance to clear or commit them. Or any other suggestions you might have to resolve the above bugs. I am satisfied my PR has the correct semantics and consistent behavior, I just need advice on how to make it efficient.

There is a way to reset the capturing buffers in O(1) upon backtracking and avoid calling memset(). Instead of writing the start and end positions of a captured substring in a random access manner at their "final" index slot in the capture array, it would simpler to use the capture array using a stack discipline, sort of.

Besides the capture array base and capacity fields, it would require a top field and a current field. Both of them would need to be saved in a newly pushed choice point/backtracking frame.

OPEN could simply push a new capture struct on the catpure stack and store the start position in it as if every alternative was inside a branch reset group (?| () | () ). Additionally, OPEN would also write the real capture number in the newly pushed capture struct. The real capture number would be stored in the OPEN regop struct.

Upon backtracking, it would suffice to reset top and current to their previous values. Then new capturing groups could safely override previous ones.

At the end of a successful match, the array @{^CAPTURE} would be constructed with each capture group stored at their correct array indexes. Similarly with $1, $2, etc..

The description above is incomplete and awkward to fully put into words so here is the complete semantic description:

# working example
1 2 2 3 3   4 4 1  # capture group real indexes
( ( ) ( ) | ( ) )
0 1 1 2 2   1 1 0  # capture array indexes

# order in which capture structs are written (considering no backtracking occurs)
( 0 1 2
) 1 2 0

REGEX TRY INIT {
    top = -1;
    current = -1;
}

// The variable pos refers to the current position in the string being matched.

OPEN(n) {
    top++;
    capture[top].start = pos;
    capture[top].end = -1; // in order for (*ACCEPT) to know that the capturing group has not been closed yet
    capture[top].num = n;
    capture[top].prev_current = current;
    current = top;
}

CLOSE {
    capture[current].end = pos;
    current = capture[current].prev_current;
}

BACKTRACK {
    top = last_choice_point->top;
    current = last_choice_point->current;
}

ACCEPT { // (*ACCEPT)
    for (int i=0; i <= top; i++) {
        if (capture[i].end == -1) {
            capture[i].end = pos;
        }
    }
}

May 08 '24 03:05 florian-pe

perl5 perl5 copied to clipboard

fixup many regexp bugs - not to be merged yet

perl5
perl5 copied to clipboard