perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

use re '/aansx'; breaks split " ", ...

Open druud opened this issue 5 years ago • 11 comments

$ perldoc -f split [...] As another special case, "split" emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a string composed of a single space character [...] In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were "/\s+/"; [...] in Perl 5.18.0 and later this special case is triggered by any expression which evaluates to the simple string " ". [...] If omitted, PATTERN defaults to a single space, " ", triggering the previously described awk emulation. [...]

$ perl -Mstrict -wE'

  my $s= "  foo  bar  ";

  say for split " ", $s;

  { use re qw( /aansx );
    say for split " ", $s;
  }
'
foo
bar
 
 
f
o
o
 
 
b
a
r 



---
Site configuration information for perl 5.24.3:

Configured by root at Tue Feb 13 14:35:15 CET 2018.
[...]
Locally applied patches:
    fix for #132063
    fix for #132227
[...]

druud avatar Aug 05 '20 11:08 druud

I guess that split needs (?^: ... ).

druud avatar Aug 05 '20 11:08 druud

Simplifies to:

perl -Mstrict -wE'

  my $s= "  foo  bar  ";
  say for split / /, $s;
  say for split / /aansx, $s;
'

I'm confused as to what the bug/issue is here. Surely the use of x in the use re statement tell the interpreter to treat the subsequent line as if it had been written say for split "", $s;? In which case the output is what should be expected.

Is this perhaps somewhere where we could usefully improve the documentation, most likely that of the re module?

richardleach avatar Feb 13 '24 23:02 richardleach

split " " is not the same thing as split / /. (Also, this is still broken in blead.)

mauke avatar Feb 13 '24 23:02 mauke

split " " is not the same thing as split / /. (Also, this is still broken in blead.)

Doubtless I will look at this again in the morning and feel really stupid, but how do they differ?

They both generate the same op tree with the same split op:

9              </> split(/" "/)[t2] lKM/IMPLIM ->a

and the same final regex program:

Final program:
   1: EXACT < > (3)
   3: END (0)

Or is the bug that they do generate the same thing but in fact should not?

richardleach avatar Feb 14 '24 00:02 richardleach

$ perl -wE 'say "[$_]" for split / /, "  foo\tbar"'
[]
[]
[foo    bar]
$ perl -wE 'say "[$_]" for split " ", "  foo\tbar"'
[foo]
[bar]
$

As for how this is implemented, I think it involves regex flags (RXf_WHITE and RXf_SKIPWHITE, which perlreapi claims is unused, but pp.c still checks for it (?)).

Anyway, as the documentation quoted by OP explains, passing a string consisting of a single space to split is supposed to act like a flag that makes split do the equivalent of $string =~ /\S+/g as a special case. It is not treated as a regex, so any use re flags should be ignored.

mauke avatar Feb 14 '24 00:02 mauke

I think it involves regex flags (RXf_WHITE and RXf_SKIPWHITE,

Yeah, it does. I forgot they were regex flags rather than split's op_private flags. D'oh. Not sure offhand how to dump those.

I think was just reading the documentation from a different angle. The docs say:

"..."split" emulates the default behavior of the command line tool awk when the PATTERN is either
omitted or a string composed of a single space character". 

Those docs don't strictly say that a single space is treated like a flag, just that that is the behaviour when the pattern is a single space. But since use re changes all patterns, the pattern being asked for changes from a single space to an empty pattern. That makes perfect sense to me, but I understand now that the argument is that split " " should irregardless get sufficient special handling that use re flags don't modify it.

On that basis, I'm happy to look into it, unless someone's already started doing so and hasn't assigned themselves the ticket. https://perldoc.perl.org/perlreapi makes it sound like a parsing bug, but I dunno if it's a bug in e.g. toke.c or in regex compilation (more likely the latter?).

richardleach avatar Feb 14 '24 00:02 richardleach

I think it involves regex flags (RXf_WHITE and RXf_SKIPWHITE,

Yeah, it does. I forgot they were regex flags rather than split's op_private flags. D'oh. Not sure offhand how to dump those.

@richardleach see 7042511892 for one way, and the discussion in #18515 for more options.

hvds avatar Feb 14 '24 13:02 hvds

This is a possible fix, in the existing part of regcomp.c that handles special patterns. Does anyone spot any obvious goofs before I prepare a PR?

diff --git a/regcomp.c b/regcomp.c
index f3e2f5e5e7..90105e9433 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -2222,9 +2222,17 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
         /* It's safe to read through *next only if OP(first) is a regop of
          * the right type (not EXACT, for example).
          */
-        if (REGNODE_TYPE(fop) == NOTHING && nop == END)
-            RExC_rx->extflags |= RXf_NULL;
-        else if ((fop == MBOL || (fop == SBOL && !FLAGS(first))) && nop == END)
+        if (REGNODE_TYPE(fop) == NOTHING && nop == END) {
+            if ((RExC_rx->extflags & RXf_SPLIT)
+                && pat_count == 1
+                && OP_TYPE_IS(expr, OP_CONST)
+                && SvPOK(cSVOPx_sv(expr)) && SvCUR(cSVOPx_sv(expr)) == 1
+                && SvPV_nolen(cSVOPx_sv(expr))[0] == ' '
+            ){
+                RExC_rx->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
+            } else
+                RExC_rx->extflags |= RXf_NULL;
+        } else if ((fop == MBOL || (fop == SBOL && !FLAGS(first))) && nop == END)
             /* when fop is SBOL first->flags will be true only when it was
              * produced by parsing /\A/, and not when parsing /^/. This is
              * very important for the split code as there we want to
@@ -2241,7 +2249,6 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
                   && *(STRING(first)) == ' '
                   && OP(regnext(first)) == END )
             RExC_rx->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
-
     }
 
     if (RExC_contains_locale) {

richardleach avatar Feb 26 '24 17:02 richardleach

^ The above patch is brittle when it comes to telling split " " apart from split / /. Needs more work.

richardleach avatar Feb 26 '24 21:02 richardleach

^ The above patch is brittle when it comes to telling split " " apart from split / /. Needs more work.

LOL, no it's fine in that regard. Will raise a PR.

richardleach avatar Feb 26 '24 21:02 richardleach

I agree that STR would be better changed to STRING. The only places I saw in core for plainSTR are for index and rindes

khwilliamson avatar Mar 14 '24 19:03 khwilliamson