perl5
perl5 copied to clipboard
Fix \d script run with unusual Unicode data layout
This fixes GH #22535
Unicode guarantees that \d code points occur in groups of 10 consecutive ones, with the lowest having a numeric value of 0 and the highest having a value of 9.
A script run in a regular expression pattern matches only characters in a single script. Further, if more than a single digit is matched, all must come from the same group of 10 consecutive code points.
The 'Common' script has many such groups, not just 0-9. Perl's implementation assumed that all groups were isolated from each other in the Unicode ordering of code points. This is true in all but one case where there are 5 groups which adjoin each other. This commit changes the implementation to be cognizant of this possibility.
In blead, t/re/run_script.t executes 185 unit tests:
$ ./perl -Ilib t/re/script_run.t
...
ok 178 - Non-ASCII Common digits work with Latin
ok 179 - Non-ASCII Common digits work with Latin
ok 180 - Non-ASCII Common digits work with Latin
ok 181 - Non-ASCII Common digits work with Latin
ok 182 - Non-ASCII Common digits work with Latin
ok 183 - Non-ASCII Common digits work with Greek
ok 184 - Non-ASCII Common digits work with Greek
ok 185 - [perl \#133997]
1..185
$ cd t; ./perl harness re/script_run.t; cd -
re/script_run.t .. ok
All tests successful.
Files=1, Tests=185, 0 wallclock secs ( 0.01 usr 0.00 sys + 0.02 cusr 0.00 csys = 0.03 CPU)
Result: PASS
When I began to review this pull request, I decided to take a test-driven approach, i.e., what would happen if locally I first added only the new unit tests in run_script.t, without modifying regexec.c.
$ gitshowf |cat
commit f01ab297b015c717d63cd082ba6a9683c7767933
Author: James E Keenan <[email protected]>
AuthorDate: Sat Aug 24 08:44:10 2024 -0400
Commit: James E Keenan <[email protected]>
CommitDate: Sat Aug 24 08:44:10 2024 -0400
Unit tests suggested in GH #22536
... prior to adding the code changes which should make these tests PASS.
diff --git a/t/re/script_run.t b/t/re/script_run.t
index 5c6b40c390..2be21d5e41 100644
--- a/t/re/script_run.t
+++ b/t/re/script_run.t
@@ -117,6 +117,13 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') {
like("\x{1d7ce}αβγ", qr/^(*sr:.{4})/,
"Non-ASCII Common digits work with Greek"); # perl #133547
+ # GH #22535
+ unlike("A\x{1d7f5}\x{1d7ff}B", qr/^(*sr:.{4})/,
+ "Verify works when Unicode has multiple adjacent \\d runs,"
+ . " in different runs");
+ like("A\x{1d7f6}\x{1d7ff}B", qr/^(*sr:.{4})/,
+ "Verify works when Unicode has multiple adjacent \\d runs, same run");
+
fresh_perl_is('print scalar "0" =~ m!(((*sr:()|)0)(*sr:)0|)!;',
1, {}, '[perl #133997]');
Presumably those new unit tests would FAIL until such time as I modified regexec.c. But the result was perverse.
$ sh ./Configure -des -Dusedevel && make test_prep
...
$ ./perl -Ilib t/re/script_run.t
...
ok 178 - Non-ASCII Common digits work with Latin
ok 179 - Non-ASCII Common digits work with Latin
ok 180 - Non-ASCII Common digits work with Latin
ok 181 - Non-ASCII Common digits work with Latin
ok 182 - Non-ASCII Common digits work with Latin
ok 183 - Non-ASCII Common digits work with Greek
ok 184 - Non-ASCII Common digits work with Greek
Modification of a read-only value attempted at ./test.pl line 337.
The program encounters a fatal error at the first of the #22535 tests. Here's the relevant code from t/test.pl:
312 # A way to display scalars containing control characters and Unicode.
313 # Trying to avoid setting $_, or relying on local $_ to work.
314 sub display {
315 my @result;
316 foreach my $x (@_) {
317 if (defined $x and not ref $x) {
318 my $y = '';
319 foreach my $c (unpack($chars_template, $x)) {
320 if ($c > 255) {
321 $y = $y . sprintf "\\x{%x}", $c;
322 } elsif ($backslash_escape{$c}) {
323 $y = $y . $backslash_escape{$c};
324 } elsif ($c < ord " ") {
325 # Use octal for characters with small ordinals that are
326 # traditionally expressed as octal: the controls below
327 # space, which on EBCDIC are almost all the controls, but
328 # on ASCII don't include DEL nor the C1 controls.
329 $y = $y . sprintf "\\%03o", $c;
330 } elsif (chr $c =~ /[[:print:]]/a) {
331 $y = $y . chr $c;
332 }
333 else {
334 $y = $y . sprintf "\\x%02X", $c;
335 }
336 }
337 $x = $y; ### <--- POINT OF FAILURE
338 }
339 return $x unless wantarray;
340 push @result, $x;
341 }
342 return @result;
343 }
Can anyone say what's going on?
Yes. It's a bug in test.pl. It is trying to modify a literal string for better display. I ran into this myself developing the patch; it took me a while to figure out, but I am preparing a patch for test.pl that avoids the problem.
That patch is #22537
That patch is #22537
Yes, once we nail that patch down, this should be okay.