perl5
perl5 copied to clipboard
RegEx pointless backtracking between non-overlapping sub-patterns
https://github.com/MasterInQuestion/coordTransform/commit/dabc782b3e200d81210032f4dc6222b344e13d0b
.
https://github.com/Perl/perl5/issues/22636#issuecomment-2387520738
“Backtracking past "d" for "e" alike, would be entirely pointless: for non-overlapping impossible to match.
Alike in linked past "-" (or ".") for "\d" etc.”
Test case: [[
> perl -e "'deEfficiency' =~ /d?(?:eficiency|[eE]fficiency)/;" -M"re"="debug"
Compiling REx "d?(?:eficiency|[eE]fficiency)"
Final program:
1: CURLY{0,1} (7)
5: EXACT <d> (0)
7: BRANCH (buf:0/0) (13)
9: EXACT <eficiency> (22)
13: BRANCH (buf:0/0) (FAIL)
15: ANYOFM[Ee] (17)
17: EXACT <fficiency> (22)
21: TAIL (22)
22: END (0)
minlen 9
Matching REx "d?(?:eficiency|[eE]fficiency)" against "deEfficiency"
0 <> <deEfficien> | 0| 1:CURLY{0,1}(7)
| 0| EXACT <d> can match 1 times out of 1...
1 <d> <eEfficienc> | 1| 7:BRANCH (buf:0/0)(13)
1 <d> <eEfficienc> | 2| 9:EXACT <eficiency>(22)
| 2| failed...
1 <d> <eEfficienc> | 1| 13:BRANCH (buf:0/0)(21)
1 <d> <eEfficienc> | 2| 15:ANYOFM[Ee](17)
2 <de> <Efficiency> | 2| 17:EXACT <fficiency>(22)
| 2| failed...
| 1| BRANCH failed...
0 <> <deEfficien> | 1| 7:BRANCH (buf:0/0)(13)
0 <> <deEfficien> | 2| 9:EXACT <eficiency>(22)
| 2| failed...
0 <> <deEfficien> | 1| 13:BRANCH (buf:0/0)(21)
0 <> <deEfficien> | 2| 15:ANYOFM[Ee](17)
| 2| failed...
| 1| BRANCH failed...
| 0| failed...
1 <d> <eEfficienc> | 0| 1:CURLY{0,1}(7)
| 0| EXACT <d> can match 0 times out of 1...
1 <d> <eEfficienc> | 1| 7:BRANCH (buf:0/0)(13)
1 <d> <eEfficienc> | 2| 9:EXACT <eficiency>(22)
| 2| failed...
1 <d> <eEfficienc> | 1| 13:BRANCH (buf:0/0)(21)
1 <d> <eEfficienc> | 2| 15:ANYOFM[Ee](17)
2 <de> <Efficiency> | 2| 17:EXACT <fficiency>(22)
| 2| failed...
| 1| BRANCH failed...
| 0| failed...
2 <de> <Efficiency> | 0| 1:CURLY{0,1}(7)
| 0| EXACT <d> can match 0 times out of 1...
2 <de> <Efficiency> | 1| 7:BRANCH (buf:0/0)(13)
2 <de> <Efficiency> | 2| 9:EXACT <eficiency>(22)
| 2| failed...
2 <de> <Efficiency> | 1| 13:BRANCH (buf:0/0)(21)
2 <de> <Efficiency> | 2| 15:ANYOFM[Ee](17)
3 <deE> <fficiency> | 2| 17:EXACT <fficiency>(22)
12 <deEfficiency> <> | 2| 22:END(0)
Match successful!
Freeing REx: "d?(?:eficiency|[eE]fficiency)"
> perl -e "'deEfficiency' =~ /d?+(?:eficiency|[eE]fficiency)/;" -M"re"="debug"
Compiling REx "d?+(?:eficiency|[eE]fficiency)"
Final program:
1: SUSPEND (11)
3: CURLY{0,1} (9)
7: EXACT <d> (0)
9: SUCCEED (0)
10: TAIL (11)
11: BRANCH (buf:0/0) (17)
13: EXACT <eficiency> (26)
17: BRANCH (buf:0/0) (FAIL)
19: ANYOFM[Ee] (21)
21: EXACT <fficiency> (26)
25: TAIL (26)
26: END (0)
minlen 9
Matching REx "d?+(?:eficiency|[eE]fficiency)" against "deEfficiency"
0 <> <deEfficien> | 0| 1:SUSPEND(11)
0 <> <deEfficien> | 1| 3:CURLY{0,1}(9)
| 1| EXACT <d> can match 1 times out of 1...
1 <d> <eEfficienc> | 2| 9:SUCCEED(0)
| 2| SUCCEED: subpattern success...
1 <d> <eEfficienc> | 0| 11:BRANCH (buf:0/0)(17)
1 <d> <eEfficienc> | 1| 13:EXACT <eficiency>(26)
| 1| failed...
1 <d> <eEfficienc> | 0| 17:BRANCH (buf:0/0)(25)
1 <d> <eEfficienc> | 1| 19:ANYOFM[Ee](21)
2 <de> <Efficiency> | 1| 21:EXACT <fficiency>(26)
| 1| failed...
| 0| BRANCH failed...
1 <d> <eEfficienc> | 0| 1:SUSPEND(11)
1 <d> <eEfficienc> | 1| 3:CURLY{0,1}(9)
| 1| EXACT <d> can match 0 times out of 1...
1 <d> <eEfficienc> | 2| 9:SUCCEED(0)
| 2| SUCCEED: subpattern success...
1 <d> <eEfficienc> | 0| 11:BRANCH (buf:0/0)(17)
1 <d> <eEfficienc> | 1| 13:EXACT <eficiency>(26)
| 1| failed...
1 <d> <eEfficienc> | 0| 17:BRANCH (buf:0/0)(25)
1 <d> <eEfficienc> | 1| 19:ANYOFM[Ee](21)
2 <de> <Efficiency> | 1| 21:EXACT <fficiency>(26)
| 1| failed...
| 0| BRANCH failed...
2 <de> <Efficiency> | 0| 1:SUSPEND(11)
2 <de> <Efficiency> | 1| 3:CURLY{0,1}(9)
| 1| EXACT <d> can match 0 times out of 1...
2 <de> <Efficiency> | 2| 9:SUCCEED(0)
| 2| SUCCEED: subpattern success...
2 <de> <Efficiency> | 0| 11:BRANCH (buf:0/0)(17)
2 <de> <Efficiency> | 1| 13:EXACT <eficiency>(26)
| 1| failed...
2 <de> <Efficiency> | 0| 17:BRANCH (buf:0/0)(25)
2 <de> <Efficiency> | 1| 19:ANYOFM[Ee](21)
3 <deE> <fficiency> | 1| 21:EXACT <fficiency>(26)
12 <deEfficiency> <> | 1| 26:END(0)
Match successful!
Freeing REx: "d?+(?:eficiency|[eE]fficiency)"
]]
Tested against Perl 5.40.3. Been like this years ago.
Another minor caveat is that, /[eE]/:
Seems unconditionally converted to /[Ee]/.
(the order may contain frequency hints - deployable enhancement)
On Tue, Oct 01, 2024 at 01:06:26AM -0700, MasterInQuestion wrote:
https://github.com/MasterInQuestion/coordTransform/commit/dabc782b3e200d81210032f4dc6222b344e13d0b
Test case: [[
> perl -e "'deEfficiency' =~ /d?(?:eficiency|[eE]fficiency)/;" -M"re"="debug" Compiling REx "d?(?:eficiency|[eE]fficiency)" Final program: 1: CURLY{0,1} (7) 5: EXACT <d> (0) 7: BRANCH (buf:0/0) (13) 9: EXACT <eficiency> (22) 13: BRANCH (buf:0/0) (FAIL) 15: ANYOFM[Ee] (17) 17: EXACT <fficiency> (22) 21: TAIL (22) 22: END (0) minlen 9 Matching REx "d?(?:eficiency|[eE]fficiency)" against "deEfficiency" 0 <> <deEfficien> | 0| 1:CURLY{0,1}(7)
[snip 2 sets of debugging output]
So what exact behaviour do you consider sub-optimal, and what exact behaviour would you like to see in its place?
-- That he said that that that that is is is debatable, is debatable.
This code is a bit contrived. By starting the alternation with a character class you disable some of the optimizations that the current regex is capable of making. Currently perl's regex engine doesn't know how to unroll character-classes in alternations. If you manually unroll the char-class then you end up with this:
perl -Mre=debug -e'"deEfficiency" =~ /d?(?:eficiency|efficiency|Efficiency)/;'
Compiling REx "d?(?:eficiency|efficiency|Efficiency)"
Final program:
1: CURLY{0,1} (5)
3: EXACT <d> (0)
5: TRIEC-EXACT[Ee] (21)
<eficiency>
<efficiency>
<Efficiency>
21: END (0)
minlen 9
Matching REx "d?(?:eficiency|efficiency|Efficiency)" against "deEfficiency"
0 <> <deEfficien> | 0| 1:CURLY{0,1}(5)
| 0| EXACT <d> can match 1 times out of 1...
1 <d> <eEfficienc> | 1| 5:TRIEC-EXACT[Ee](21)
1 <d> <eEfficienc> | 1| TRIE: State: 1 Accepted: N TRIE: Charid: 1 CP: 65 After State: 2
2 <de> <Efficiency> | 1| TRIE: State: 2 Accepted: N TRIE: Charid: 7 CP: 45 After State: 0
| 1| failed...
0 <> <deEfficien> | 1| 5:TRIEC-EXACT[Ee](21)
| 1| TRIE: failed to match trie start class...
| 0| failed...
1 <d> <eEfficienc> | 0| 1:CURLY{0,1}(5)
| 0| EXACT <d> can match 0 times out of 1...
1 <d> <eEfficienc> | 1| 5:TRIEC-EXACT[Ee](21)
1 <d> <eEfficienc> | 1| TRIE: State: 1 Accepted: N TRIE: Charid: 1 CP: 65 After State: 2
2 <de> <Efficiency> | 1| TRIE: State: 2 Accepted: N TRIE: Charid: 7 CP: 45 After State: 0
| 1| failed...
| 0| failed...
2 <de> <Efficiency> | 0| 1:CURLY{0,1}(5)
| 0| EXACT <d> can match 0 times out of 1...
2 <de> <Efficiency> | 1| 5:TRIEC-EXACT[Ee](21)
2 <de> <Efficiency> | 1| TRIE: State: 1 Accepted: N TRIE: Charid: 7 CP: 45 After State: 13
3 <deE> <fficiency> | 1| TRIE: State: 13 Accepted: N TRIE: Charid: 2 CP: 66 After State: 14
4 <deEf> <ficiency> | 1| TRIE: State: 14 Accepted: N TRIE: Charid: 2 CP: 66 After State: 15
5 <deEff> <iciency> | 1| TRIE: State: 15 Accepted: N TRIE: Charid: 3 CP: 69 After State: 16
6 <deEffi> <ciency> | 1| TRIE: State: 16 Accepted: N TRIE: Charid: 4 CP: 63 After State: 17
7 <deEffic> <iency> | 1| TRIE: State: 17 Accepted: N TRIE: Charid: 3 CP: 69 After State: 18
8 <deEffici> <ency> | 1| TRIE: State: 18 Accepted: N TRIE: Charid: 1 CP: 65 After State: 19
9 <deEfficie> <ncy> | 1| TRIE: State: 19 Accepted: N TRIE: Charid: 5 CP: 6e After State: 1a
10 <deEfficien> <cy> | 1| TRIE: State: 1a Accepted: N TRIE: Charid: 4 CP: 63 After State: 1b
11 <deEfficienc> <y> | 1| TRIE: State: 1b Accepted: N TRIE: Charid: 6 CP: 79 After State: 1c
12 <deEfficiency> <> | 1| TRIE: State: 1c Accepted: Y TRIE: Charid: 0 CP: 0 After State: 0
| 1| TRIE: got 1 possible matches
| 1| TRIE matched word #3, continuing
| 1| TRIE: only one match left, short-circuiting: #3 <Efficiency>
12 <deEfficiency> <> | 1| 21:END(0)
Match successful!
Freeing REx: "d?(?:eficiency|efficiency|Efficiency)"
which does not backtrack and uses a DFA like approach. (A TRIE is just a special case of a DFA in this context.)
Backtracking past "d" for "e" alike would be entirely pointless: for non-overlapping impossible to match. Alike in linked past "-" (or ".") for "\d" etc.
Off-Topic: What does the number in "re" debug output mean? [[ Final program: 1: ... (7) 5: ... ]] I noted some the numbers changed between the old version's output and current.