perl5
perl5 copied to clipboard
Wrong behavior of (*THEN) when it follows a pattern containing an alternation
The following code outputs "abc" when it should output "abcd".
say $& if "abcde" =~ / ^ a (?: (?: (??{""}) b | bc (*ACCEPT) ) (*THEN) (*FAIL) | bcd ) /x;
The purpose of (??{""}) is to prevent an optimization that modify the original semantics of the pattern.
Here is the description of (*THEN) in perlre which is not entirely precise.
"(*THEN)" "(*THEN:NAME)"
This is similar to the "cut group" operator "::" from Raku. Like "(*PRUNE)", this verb always
matches, and when backtracked into on failure, it causes the regex engine to try the next
alternation in the innermost enclosing group (capturing or otherwise) that has alternations.
The relevant part is it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations..
If the real meaning is "the outter innermost enclosing group" then the current behavior is wrong since the outter group containing (*THEN) is (?: (?: (??{""}) b | bc (*ACCEPT) ) (*THEN) (*FAIL) | bcd )
Otherwise the behavior is still wrong because here the group that the regex engine backtracks into ((?: (??{""}) b | bc (*ACCEPT) )) is not the innermost enclosing group and the regex engine backtracks into a pattern directly before/on the left side of (*THEN) which is precisely what this group is supposed to prevent in the first place.
I agree this is not behavior correctly. I am not sure I follow the argument behind the "if the real meaning" part, or the following paragraph, but i do agree that it should match "abcd".
I will try to find time to look into this, but i don't know when.
FWIW, the issue seems to be that we set PREGf_SKIP for this pattern, which then disables retrying the stclass. I am not sure what the story is there. Running make test_reonly with the PREGf_SKIP logic disabled as so:
$ git diff
diff --git a/regcomp.c b/regcomp.c
index 6e35d95d2a..7679c56c7e 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -1917,7 +1917,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
first = REGNODE_AFTER(first);
goto again;
}
- if (sawplus && !sawminmod && !sawlookahead
+ if (0 && sawplus && !sawminmod && !sawlookahead
&& (!sawopen || !RExC_sawback)
&& !(RExC_seen & REG_PESSIMIZE_SEEN)) /* May examine pos and $& */
/* x+ must match at the 1st pos of run of x's */
I see this:
TEST_ARGS='re/*.t ../ext/re/t/*.t' PERL_TEST_HARNESS_ASAP=1 TESTFILE=harness ./runtests choose
re/alpha_assertions.t ......... ok
re/anyof.t .................... ok
re/begin-once.t ............... ok
re/bigfuzzy_not_utf8.t ........ ok
re/charset.t .................. ok
re/fold_grind_8.t ............. ok
re/fold_grind_T.t ............. skipped: Couldn't find a UTF-8 turkic locale
re/fold_grind_a.t ............. ok
re/fold_grind_aa.t ............ ok
re/fold_grind_d.t ............. ok
re/fold_grind_l.t ............. ok
re/fold_grind_u.t ............. ok
re/keep_tabs.t ................ ok
re/no_utf8_pm.t ............... ok
re/opt.t ...................... ok
re/overload.t ................. ok
re/pat.t ...................... ok
re/pat_advanced.t ............. ok
re/pat_advanced_thr.t ......... ok
re/pat_psycho.t ............... ok
re/pat_psycho_thr.t ........... ok
re/pat_re_eval.t .............. ok
re/pat_re_eval_thr.t .......... ok
re/pat_rt_report.t ............ ok
re/pat_rt_report_thr.t ........ ok
re/pat_special_cc.t ........... ok
re/pat_special_cc_thr.t ....... ok
re/pat_thr.t .................. ok
re/pos.t ...................... ok
re/qr-72922.t ................. ok
re/qr.t ....................... ok
re/qr_gc.t .................... ok
re/qrstack.t .................. ok
re/recompile.t ................ ok
re/reg_60508.t ................ ok
re/reg_email.t ................ ok
re/reg_email_thr.t ............ ok
re/reg_eval.t ................. ok
re/reg_eval_scope.t ........... ok
re/reg_fold.t ................. ok
re/reg_mesg.t ................. ok
re/reg_namedcapture.t ......... ok
re/reg_nc_tie.t ............... ok
re/reg_nocapture.t ............ ok
re/reg_pmod.t ................. ok
re/reg_posixcc.t .............. ok
re/regex_sets.t ............... ok
re/regex_sets_compat.t ........ ok
re/regexp.t ................... ok
re/regexp_noamp.t ............. ok
re/regexp_nonull.t ............ ok
re/regexp_normal.t ............ ok
re/regexp_notrie.t ............ ok
re/regexp_qr.t ................ ok
re/regexp_qr_embed.t .......... ok
re/regexp_qr_embed_thr.t ...... ok
re/regexp_trielist.t .......... ok
re/regexp_unicode_prop.t ...... ok
re/regexp_unicode_prop_thr.t .. ok
re/rt122747.t ................. ok
re/rxcode.t ................... ok
re/script_run.t ............... ok
re/speed.t .................... ok
re/speed_thr.t ................ ok
re/stclass_threads.t .......... 3/6 # Failed test 4 - Behavior appears to be sub quadratic (19.3, 154.48) at re/stclass_threads.t line 77
# Failed test 5 - Behavior is linear and not quadratic (154.48, 1504.498) at re/stclass_threads.t line 78
# Failed test 6 - Behavior is linear as expected at re/stclass_threads.t line 79
re/stclass_threads.t .......... Failed 3/6 subtests
re/subst.t .................... ok
re/substT.t ................... ok
re/subst_amp.t ................ ok
re/subst_wamp.t ............... ok
re/uniprops01.t ............... ok
re/uniprops02.t ............... ok
re/uniprops03.t ............... ok
re/uniprops04.t ............... ok
re/uniprops05.t ............... ok
re/uniprops06.t ............... ok
re/uniprops07.t ............... ok
re/uniprops08.t ............... ok
re/uniprops09.t ............... ok
re/uniprops10.t ............... ok
re/user_prop_race_thr.t ....... ok
../ext/re/t/intflags.t ........ ok
../ext/re/t/lexical_debug.t ... ok
../ext/re/t/qr.t .............. ok
../ext/re/t/re.t .............. ok
../ext/re/t/re_funcs.t ........ 1/40
# Failed test '/[ab]+/ has skip'
# at t/re_funcs.t line 85.
# Looks like you failed 1 test of 40.
../ext/re/t/re_funcs.t ........ Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/40 subtests
../ext/re/t/re_funcs_u.t ...... ok
../ext/re/t/reflags.t ......... ok
../ext/re/t/regop.t ........... ok
../ext/re/t/strict.t .......... ok
Test Summary Report
-------------------
re/stclass_threads.t (Wstat: 0 Tests: 6 Failed: 3)
Failed tests: 4-6
../ext/re/t/re_funcs.t (Wstat: 256 (exited 1) Tests: 40 Failed: 1)
Failed test: 35
Non-zero exit status: 1
Files=89, Tests=388144, 228 wallclock secs (34.14 usr 0.33 sys + 199.12 cusr 13.55 csys = 247.14 CPU)
Result: FAIL
Finished test run at Fri Sep 29 11:19:47 2023.
make: *** [makefile:867: test_reonly] Error 4
Some of the related logic looks like it has changed since the last time I looked at it, and I suspect this is a regression from those changes, but i don't know for sure.
Yes, sorry this paragraph it's a bit dodgy, I was trying to explain and show that the perldoc description is a bit imprecise but on second thought, the description seems adequate and the "outtermost" adjective would be redundant.
For reference, this behaviour hasn't changed since 5.14:
$ eachperl -E 'say $& if "abcde" =~ / ^ a (?: (?: (??{""}) b | bc (*ACCEPT) ) (*THEN) (*FAIL) | bcd ) /x;'
--- perl5.14.4 ---
abc
--- perl5.16.3 ---
abc
--- perl5.18.4 ---
abc
...
--- perl5.38.2 ---
abc
--- perl5.39.8 ---
abc