perl5 Fix \d script run with unusual Unicode data layout

This fixes GH #22535

Unicode guarantees that \d code points occur in groups of 10 consecutive ones, with the lowest having a numeric value of 0 and the highest having a value of 9.

A script run in a regular expression pattern matches only characters in a single script. Further, if more than a single digit is matched, all must come from the same group of 10 consecutive code points.

The 'Common' script has many such groups, not just 0-9. Perl's implementation assumed that all groups were isolated from each other in the Unicode ordering of code points. This is true in all but one case where there are 5 groups which adjoin each other. This commit changes the implementation to be cognizant of this possibility.

Aug 23 '24 21:08 khwilliamson

In blead, t/re/run_script.t executes 185 unit tests:

$ ./perl -Ilib t/re/script_run.t
...
ok 178 - Non-ASCII Common digits work with Latin
ok 179 - Non-ASCII Common digits work with Latin
ok 180 - Non-ASCII Common digits work with Latin
ok 181 - Non-ASCII Common digits work with Latin
ok 182 - Non-ASCII Common digits work with Latin
ok 183 - Non-ASCII Common digits work with Greek
ok 184 - Non-ASCII Common digits work with Greek
ok 185 - [perl \#133997]
1..185

$ cd t; ./perl harness re/script_run.t; cd -
re/script_run.t .. ok     
All tests successful.
Files=1, Tests=185,  0 wallclock secs ( 0.01 usr  0.00 sys +  0.02 cusr  0.00 csys =  0.03 CPU)
Result: PASS

When I began to review this pull request, I decided to take a test-driven approach, i.e., what would happen if locally I first added only the new unit tests in run_script.t, without modifying regexec.c.

$ gitshowf |cat
commit f01ab297b015c717d63cd082ba6a9683c7767933
Author:     James E Keenan <[email protected]>
AuthorDate: Sat Aug 24 08:44:10 2024 -0400
Commit:     James E Keenan <[email protected]>
CommitDate: Sat Aug 24 08:44:10 2024 -0400

    Unit tests suggested in GH #22536
    
    ... prior to adding the code changes which should make these tests PASS.

diff --git a/t/re/script_run.t b/t/re/script_run.t
index 5c6b40c390..2be21d5e41 100644
--- a/t/re/script_run.t
+++ b/t/re/script_run.t
@@ -117,6 +117,13 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') {
     like("\x{1d7ce}αβγ", qr/^(*sr:.{4})/,
          "Non-ASCII Common digits work with Greek"); # perl #133547
 
+    # GH #22535
+    unlike("A\x{1d7f5}\x{1d7ff}B", qr/^(*sr:.{4})/,
+           "Verify works when Unicode has multiple adjacent \\d runs,"
+         . " in different runs");
+    like("A\x{1d7f6}\x{1d7ff}B", qr/^(*sr:.{4})/,
+         "Verify works when Unicode has multiple adjacent \\d runs, same run");
+
     fresh_perl_is('print scalar "0" =~ m!(((*sr:()|)0)(*sr:)0|)!;',
                   1, {}, '[perl #133997]');

Presumably those new unit tests would FAIL until such time as I modified regexec.c. But the result was perverse.

$ sh ./Configure -des -Dusedevel && make test_prep
...
$ ./perl -Ilib t/re/script_run.t
...
ok 178 - Non-ASCII Common digits work with Latin
ok 179 - Non-ASCII Common digits work with Latin
ok 180 - Non-ASCII Common digits work with Latin
ok 181 - Non-ASCII Common digits work with Latin
ok 182 - Non-ASCII Common digits work with Latin
ok 183 - Non-ASCII Common digits work with Greek
ok 184 - Non-ASCII Common digits work with Greek
Modification of a read-only value attempted at ./test.pl line 337.

The program encounters a fatal error at the first of the #22535 tests. Here's the relevant code from t/test.pl:

 312 # A way to display scalars containing control characters and Unicode.
 313 # Trying to avoid setting $_, or relying on local $_ to work.
 314 sub display {
 315     my @result;
 316     foreach my $x (@_) {
 317         if (defined $x and not ref $x) {
 318             my $y = '';
 319             foreach my $c (unpack($chars_template, $x)) {
 320                 if ($c > 255) {
 321                     $y = $y . sprintf "\\x{%x}", $c;
 322                 } elsif ($backslash_escape{$c}) {
 323                     $y = $y . $backslash_escape{$c};
 324                 } elsif ($c < ord " ") {
 325                     # Use octal for characters with small ordinals that are
 326                     # traditionally expressed as octal: the controls below
 327                     # space, which on EBCDIC are almost all the controls, but
 328                     # on ASCII don't include DEL nor the C1 controls.
 329                     $y = $y . sprintf "\\%03o", $c;
 330                 } elsif (chr $c =~ /[[:print:]]/a) {
 331                     $y = $y . chr $c;
 332                 }
 333                 else {
 334                     $y = $y . sprintf "\\x%02X", $c;
 335                 }
 336             }
 337             $x = $y;  ### <--- POINT OF FAILURE
 338         }
 339         return $x unless wantarray;
 340         push @result, $x;
 341     }
 342     return @result;
 343 }

Can anyone say what's going on?

Aug 24 '24 12:08 jkeenan

Yes. It's a bug in test.pl. It is trying to modify a literal string for better display. I ran into this myself developing the patch; it took me a while to figure out, but I am preparing a patch for test.pl that avoids the problem.

Aug 24 '24 14:08 khwilliamson

That patch is #22537

Aug 24 '24 15:08 khwilliamson

That patch is #22537

Yes, once we nail that patch down, this should be okay.

Aug 24 '24 19:08 jkeenan

perl5 perl5 copied to clipboard

Fix \d script run with unusual Unicode data layout

perl5
perl5 copied to clipboard