perl5 Interaction of case-modifiers (\U, \L, \u, \l, \F, \Q, \E) in double quoted strings

Preamble:

The goal of this ticket is an attempt to document the current behavior of the various case modifiers and how they interact with each other.

Various people have already looked at it and a lot have been written about it. This ticket tries to summarize that and put all the information in one place.

The contents of this ticket is based on:

https://github.com/Perl/perl5/issues/5467
https://github.com/Perl/perl5/issues/8846
https://github.com/Perl/perl5/issues/11145
https://github.com/Perl/perl5/issues/13257
https://github.com/Perl/perl5/issues/18981
https://github.com/Perl/perl5/issues/19670
https://www.nntp.perl.org/group/perl.perl5.porters/2012/01/msg181429.html : "\Questions about the \Future of \Escapes"
https://www.nntp.perl.org/group/perl.perl5.porters/2011/11/msg179078.html : "What's the difference between qr/\U\x{39}/ and qr/\U\x{3a}/ ?"
https://www.nntp.perl.org/group/perl.perl5.porters/2013/08/msg206466.html : "changes to \Q \F \Etc and "casemod escapes"
https://www.nntp.perl.org/group/perl.perl5.porters/2022/07/msg264466.html : "Escape sequences \L, \U fall when we use them together"

Description

Reading the various tickets shows there are (at least) two cases:

using the case-modifiers inside a double-quoted string (qq)
using the case-modifiers inside a regex and/or inside qr

In this ticket I'm focusing only on double-quoted strings and ignoring the case modifiers inside a regex.

Case modifiers can be divided into three groups: (names borrowed from an older message of @demerphq )

"non-case" modifiers: \Q (quotemeta())
"inner" case modifiers: \U (uc()), \L (lc()), \F (fc())
"outer" case modifiers: \u (ucfirst()), \l (lcfirst())

A test script to show/test/document the behavior is included at the end. (Note: this also includes 'crazy cases' and cases that don't make a lot of sense.)

An attempt at a text based description of the various "rules":

escapes sequences (\n, \t, \x.., \N{...}, ....) are applied before case modifiers
an "inner" case modifier overrules an "outer" case modifiers but it was first applied, it also has two special cases:
- \U\l is treated as \l\U (i.e. lcfirst(uc(...));
- \L\u is treated as \u\L (i.e. ucfirst(lc(...));
- \F\l and \F\u are not special cased and are treated as fc(lcfirst(...)) and fc(ucfirst(...));
- \Lfoo\ubar really is treated as: lc("foo" . ucfirst("bar"));[^1]
\E following a inner or outer case modifiers cancels it, but with a special case due to special casing in rule 1:
- \Ufoo\L\Ebar is treated as uc("foo" . "bar")
- \Ufoo\l\Ebar is treated as uc("foo" . "bar")
- \U\l\E\Ubar is treated as lcfirst(uc("bar"))[^2] (in \U\l\E the first \U is cancelled and not the \l)
- \L\u\E\LBAR is treated as ucfirst(lc("BAR"))[^2]
an "inner" case modifier implicitly ends another "inner" case modifier (i.e. no stacking) when it's not a 'cancelled modifier' (see rule 2)
- \Ufoo\Lbar is treated as uc("foo") . lc("bar")
- \Ufoo\L\Ebar is treated as uc("foo" . "bar")
- \Ufoo\L\u\Ebar due to the special casing is equivalent to: \Ufoo\u\L\Ebar which makes it equivalent to: \Ufoo\ubar
quotemeta modifier can stack:
- \Q\Q.\E\E is treated as quotemeta(quotemeta("."))
an "inner" case modifier ends the quotemeta modifier when another "inner" case modifier was applied:
- \Ua\Qb\Lc is equivalent to: uc("a" . quotemeta("b")) . lc("c")
an "inner" case modifier does not end the quotemeta modifier when another "inner" case modifier wasn't applied:
- a\Qb\Lc is equivalent to: "a" . quotemeta("b" . lc("c"))
'Immediately' repeating an inner case modifier is an error, unless it was a 'cancelled modifier' (see rule 2):
- \U\L is an error;
- \U\U is an error;
- \U\Q\U is an error;
- \U\u\U is an error;
- \U\l\U is an error with a confusing message;[^2]
- \U\L\E is not an error;
Repeating an outer case modifier is not an error:
- \u\l\u\l is not an error

Test script (click to view)

#!/usr/bin/perl -l

use strict;
use warnings;
use feature "fc";
use Test::More;

# Definitions:
# - "non-case" modifiers: \Q (quotemeta())
# - "inner" case modifiers: \U (uc()), \L (lc()), \F (fc())
# - "outer" case modifiers: \u (ucfirst()), \l (lcfirst())

# For the `fc()` tests use a characters where:
# - `lc($x) ne fc($x)` and
# - 'lc($x) ne $x'

my $fc_char = "\N{GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI}";
isnt($fc_char, lc($fc_char), "Test \$fc_char ne lc(\$fc_char)");
isnt(lc($fc_char), fc($fc_char), "Test lc(\$fc_char) ne fc(\$fc_char)");


# Basic tests: no stacking, no mixing;
is("aa\UbB", "aa" . uc("bB"), "Basic test (..\\U..)");
is("aa\LbB", "aa" . lc("bB"), "Basic test (..\\L..)");
is("aa\FbB$fc_char", "aa" . fc("bB$fc_char"), "Basic test (..\\F..)");
is("aa\ubB", "aa" . ucfirst("bB"), "Basic test (..\\u..)");
is("aa\lCc", "aa" . lcfirst("Cc"), "Basic test (..\\l..)");
is("aa\Q1+2", "aa" . quotemeta("1+2"), "Basic test (..\\Q..)");

# Basic tests: with \E
is("aa\UbB\EcC", "aa" . uc("bB") . "cC", "Basic test with \\E (..\\U..\\E..)");
is("aa\LbB\EcC", "aa" . lc("bB") . "cC", "Basic test with \\E (..\\L..\\E..)");
is("aa\FbB$fc_char\EcC", "aa" . fc("bB$fc_char") . "cC", "Basic test with \\E (..\\F..\\E..)");
is("aa\Q1+2\E3+4", "aa" . quotemeta("1+2") . "3+4", "Basic test with \\E (..\\Q..\\E..)");

# \E cancels a \u, \l, \U, \L, \F
is("aa\UbB\U\EcC\EdD", "aa" . uc("bB" . "cC") . "dD", "\\E cancel a \\U (..\\U..\\U\\E..\\E");
is("aa\UbB\L\EcC\EdD", "aa" . uc("bB" . "cC") . "dD", "\\E cancel a \\L (..\\U..\\L\\E..\\E");
is("aa\UbB\F\EcC\EdD", "aa" . uc("bB" . "cC") . "dD", "\\E cancel a \\F (..\\U..\\F\\E..\\E");
is("aa\UbB\u\EcC\EdD", "aa" . uc("bB" . "cC") . "dD", "\\E cancel a \\u (..\\U..\\u\\E..\\E");
is("aa\UbB\l\EcC\EdD", "aa" . uc("bB" . "cC") . "dD", "\\E cancel a \\l (..\\U..\\l\\E..\\E");
is("aa\LbB\U\EcC\EdD", "aa" . lc("bB" . "cC") . "dD", "\\E cancel a \\U (..\\L..\\U\\E..\\E");
is("aa\LbB\L\EcC\EdD", "aa" . lc("bB" . "cC") . "dD", "\\E cancel a \\L (..\\L..\\L\\E..\\E");
is("aa\LbB\F\EcC\EdD", "aa" . lc("bB" . "cC") . "dD", "\\E cancel a \\F (..\\L..\\F\\E..\\E");
is("aa\LbB\u\EcC\EdD", "aa" . lc("bB" . "cC") . "dD", "\\E cancel a \\u (..\\L..\\u\\E..\\E");
is("aa\LbB\l\EcC\EdD", "aa" . lc("bB" . "cC") . "dD", "\\E cancel a \\l (..\\L..\\l\\E..\\E");
is("aa\FbB\U\EcC\EdD", "aa" . fc("bB" . "cC") . "dD", "\\E cancel a \\U (..\\F..\\U\\E..\\E");
is("aa\FbB\L\EcC\EdD", "aa" . fc("bB" . "cC") . "dD", "\\E cancel a \\L (..\\F..\\L\\E..\\E");
is("aa\FbB\F\EcC\EdD", "aa" . fc("bB" . "cC") . "dD", "\\E cancel a \\F (..\\F..\\F\\E..\\E");
is("aa\FbB\u\EcC\EdD", "aa" . fc("bB" . "cC") . "dD", "\\E cancel a \\u (..\\F..\\u\\E..\\E");
is("aa\FbB\l\EcC\EdD", "aa" . fc("bB" . "cC") . "dD", "\\E cancel a \\l (..\\F..\\l\\E..\\E");

# Immediately repeating case modifiers is an error
eval q#"\U\U"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\U\\U)");
eval q#"\U\L"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\U\\L)");
eval q#"\U\F"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\U\\F)");
eval q#"\L\U"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\L\\U)");
eval q#"\L\L"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\L\\L)");
eval q#"\L\F"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\L\\F)");
eval q#"\F\U"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\F\\U)");
eval q#"\F\L"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\F\\L)");
eval q#"\F\F"#; isnt($@, "", "immediately repeating inner modifiers is an error (\\F\\F)");

# repeating an inner modifier after an outer modifier is an error
eval q#"\U\u\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\U)");
eval q#"\U\l\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\U)");  # special
eval q#"\U\Q\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\U)");
eval q#"\U\u\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\L)");
eval q#"\U\l\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\L)");  # special
eval q#"\U\Q\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\L)");
eval q#"\U\u\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\F)");
eval q#"\U\l\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\F)");  # special
eval q#"\U\Q\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\F)");
eval q#"\L\u\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\U)");  # special
eval q#"\L\l\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\U)");
eval q#"\L\Q\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\U)");
eval q#"\L\u\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\L)");  # special
eval q#"\L\l\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\L)");
eval q#"\L\Q\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\L)");
eval q#"\L\u\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\F)");  # special
eval q#"\L\l\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\F)");
eval q#"\L\Q\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\F)");
eval q#"\F\u\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\U)");
eval q#"\F\l\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\U)");
eval q#"\F\Q\U"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\U)");
eval q#"\F\u\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\L)");
eval q#"\F\l\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\L)");
eval q#"\F\Q\L"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\L)");
eval q#"\F\u\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\u\\F)");
eval q#"\F\l\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\l\\F)");
eval q#"\F\Q\F"#; isnt($@, "", "repeating inner modifiers is an error (\\U\\Q\\F)");
# For the special cases the error message is a bit misleading:
# - the parsers changes "\U\l" into "\l\U" so the pattern becomes "\l\U\U" and it shows
#   that in the error instead of the original pattern ("\U\l\U")
# - the parsers changes "\L\u" into "\u\L" so the pattern becomes "\u\L\L" and it shows
#   that in the error instead of the original pattern ("\u\L\L")
eval q#"\U\l\U"#; like($@, qr/\\l\\U\\U/, "check (misleading) error message (\\U\\l\\U)");
eval q#"\U\l\L"#; like($@, qr/\\l\\U\\L/, "check (misleading) error message (\\U\\l\\L)");
eval q#"\U\l\F"#; like($@, qr/\\l\\U\\F/, "check (misleading) error message (\\U\\l\\F)");
eval q#"\L\u\U"#; like($@, qr/\\u\\L\\U/, "check (misleading) error message (\\L\\u\\U)");
eval q#"\L\u\L"#; like($@, qr/\\u\\L\\L/, "check (misleading) error message (\\L\\u\\L)");
eval q#"\L\u\F"#; like($@, qr/\\u\\L\\F/, "check (misleading) error message (\\L\\u\\F)");


# A cancelled repeating case modifier is not an error
eval q#"\U\U\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\U\\U\\E)");
eval q#"\U\L\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\U\\L\\E)");
eval q#"\U\F\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\U\\F\\E)");
eval q#"\L\U\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\L\\U\\E)");
eval q#"\L\L\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\L\\L\\E)");
eval q#"\L\F\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\L\\F\\E)");
eval q#"\F\U\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\F\\U\\E)");
eval q#"\F\L\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\F\\L\\E)");
eval q#"\F\F\E"#; is($@, "", "cancelled repeated inner modifiers not an error (\\F\\F\\E)");

# Cancelling an outer modifier resulting in an repeated inner modifier is an error but with exceptions
eval q#"\U\u\E\U"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\U\\u\\E\\U)");
eval q#"\U\u\E\L"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\U\\u\\E\\L)");
eval q#"\U\u\E\F"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\U\\u\\E\\F)");
eval q#"\L\l\E\U"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\L\\l\\E\\U)");
eval q#"\L\l\E\L"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\L\\l\\E\\L)");
eval q#"\L\l\E\F"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\L\\l\\E\\F)");
eval q#"\F\u\E\U"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\F\\u\\E\\U)");
eval q#"\F\u\E\L"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\F\\u\\E\\L)");
eval q#"\F\u\E\F"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\F\\u\\E\\F)");
eval q#"\F\l\E\U"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\F\\l\\E\\U)");
eval q#"\F\l\E\L"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\F\\l\\E\\L)");
eval q#"\F\l\E\F"#; isnt($@, "", "cancelled outer modifier resulting in repeated in inner modifier is an error (\\F\\l\\E\\F)");
# Exceptions:
# - parser turns '\U\l' into '\l\U' so the pattern '\U\l\E\U' becomes: '\l\U\E\U` which makes the final pattern '\l\U' which is
#   not an error.
# - parser turns '\L\u' into '\u\L' so the pattern '\L\u\E\L' becomes: '\u\L\E\L` which makes the final pattern '\u\L' which is
#   not an error.
eval q#"\U\l\E\U"#; is($@, "", "specical case: cancelled outer modifier not an error (\\U\\l\\E\\U)");
eval q#"\U\l\E\L"#; is($@, "", "specical case: cancelled outer modifier not an error (\\U\\l\\E\\L)");
eval q#"\U\l\E\F"#; is($@, "", "specical case: cancelled outer modifier not an error (\\U\\l\\E\\F)");
eval q#"\L\u\E\U"#; is($@, "", "specical case: cancelled outer modifier not an error (\\L\\u\\E\\U)");
eval q#"\L\u\E\L"#; is($@, "", "specical case: cancelled outer modifier not an error (\\L\\u\\E\\L)");
eval q#"\L\u\E\F"#; is($@, "", "specical case: cancelled outer modifier not an error (\\L\\u\\E\\F)");
is("aa\U\l\E\UbBcC", "aa" . lcfirst(uc("bBcC")), "special case: cancelled outer modifier (\\U\\l\\E\\U");
is("aa\U\l\E\LbBcC", "aa" . lcfirst(lc("bBcC")), "special case: cancelled outer modifier (\\U\\l\\E\\L");
is("aa\U\l\E\FbBcC", "aa" . lcfirst(fc("bBcC")), "special case: cancelled outer modifier (\\U\\l\\E\\F");
is("aa\L\u\E\UbBcC", "aa" . ucfirst(uc("bBcC")), "special case: cancelled outer modifier (\\L\\u\\E\\U");
is("aa\L\u\E\LbBcC", "aa" . ucfirst(lc("bBcC")), "special case: cancelled outer modifier (\\L\\u\\E\\L");
is("aa\L\u\E\FbBcC", "aa" . ucfirst(fc("bBcC")), "special case: cancelled outer modifier (\\L\\u\\E\\F");


# esccape sequences takes predence over \U, \L, \F (and \u, \l but can't think of a way to test those)
# \t
is("aa\UbB\tcC", "aa" . uc("bB\tcC"), "\\t applied before case modifiers (..\\U..\\t..");
is("aa\LbB\tcC", "aa" . lc("bB\tcC"), "\\t applied before case modifiers (..\\L..\\t..");
is("aa\FbB\tcC", "aa" . fc("bB\tcC"), "\\t applied before case modifiers (..\\F..\\t..");
# \n
is("aa\UbB\ncC", "aa" . uc("bB\ncC"), "\\n applied before case modifiers (..\\U..\\n..");
is("aa\LbB\ncC", "aa" . lc("bB\ncC"), "\\n applied before case modifiers (..\\L..\\n..");
is("aa\FbB\ncC", "aa" . fc("bB\ncC"), "\\n applied before case modifiers (..\\F..\\n..");
# \r
is("aa\UbB\rcC", "aa" . uc("bB\rcC"), "\\r applied before case modifiers (..\\U..\\r..");
is("aa\LbB\rcC", "aa" . lc("bB\rcC"), "\\r applied before case modifiers (..\\L..\\r..");
is("aa\FbB\rcC", "aa" . fc("bB\rcC"), "\\r applied before case modifiers (..\\F..\\r..");
# \f
is("aa\UbB\fcC", "aa" . uc("bB\fcC"), "\\f applied before case modifiers (..\\U..\\f..");
is("aa\LbB\fcC", "aa" . lc("bB\fcC"), "\\f applied before case modifiers (..\\L..\\f..");
is("aa\FbB\fcC", "aa" . fc("bB\fcC"), "\\f applied before case modifiers (..\\F..\\f..");
# \b
is("aa\UbB\bcC", "aa" . uc("bB\bcC"), "\\b applied before case modifiers (..\\U..\\b..");
is("aa\LbB\bcC", "aa" . lc("bB\bcC"), "\\b applied before case modifiers (..\\L..\\b..");
is("aa\FbB\bcC", "aa" . fc("bB\bcC"), "\\b applied before case modifiers (..\\F..\\b..");
# \a
is("aa\UbB\acC", "aa" . uc("bB\acC"), "\\a applied before case modifiers (..\\U..\\a..");
is("aa\LbB\acC", "aa" . lc("bB\acC"), "\\a applied before case modifiers (..\\L..\\a..");
is("aa\FbB\acC", "aa" . fc("bB\acC"), "\\a applied before case modifiers (..\\F..\\a..");
# \e
is("aa\UbB\ecC", "aa" . uc("bB\ecC"), "\\e applied before case modifiers (..\\U..\\e..");
is("aa\LbB\ecC", "aa" . lc("bB\ecC"), "\\e applied before case modifiers (..\\L..\\e..");
is("aa\FbB\ecC", "aa" . fc("bB\ecC"), "\\e applied before case modifiers (..\\F..\\e..");
# \x{..}
is("aa\UbB\x{61}cC", "aa" . uc("bB\x{61}cC"), "\\x{..} apllied before case modifiers (..\\U..\\x{..}..)");
is("aa\LbB\x{41}cC", "aa" . lc("bB\x{41}cC"), "\\x{..} apllied before case modifiers (..\\L..\\x{..}..)");
is("aa\FbB\x{41}cC", "aa" . fc("bB\x{41}cC"), "\\x{..} apllied before case modifiers (..\\F..\\x{..}..)");
# \x..
is("aa\UbB\x61cC", "aa" . uc("bB\x61cC"), "\\x.. apllied before case modifiers (..\\U..\\x61..)");
is("aa\LbB\x41cC", "aa" . lc("bB\x41cC"), "\\x.. apllied before case modifiers (..\\L..\\x41..)");
is("aa\FbB\x41cC", "aa" . fc("bB\x41cC"), "\\x.. apllied before case modifiers (..\\F..\\x41..)");
# \N{..}
is("aa\UbB\N{LATIN SMALL LETTER A}cC", "aa" . uc("bB\N{LATIN SMALL LETTER A}cC"), "\\N{..} apllied before case modifiers (..\\U..\\N{..}..)");
is("aa\LbB\N{LATIN CAPITAL LETTER A}cC", "aa" . lc("bB\N{LATIN CAPITAL LETTER A}cC"), "\\N{..} apllied before case modifiers (..\\L..\\N{..}..)");
is("aa\FbB\N{LATIN CAPITAL LETTER A}cC", "aa" . fc("bB\N{LATIN CAPITAL LETTER A}cC"), "\\N{..} apllied before case modifiers (..\\F..\\N{..}..)");
# \N{..}
is("aa\UbB\N{U+0061}cC", "aa" . uc("bB\N{U+0061}cC"), "\\N{U+....} apllied before case modifiers (..\\U..\\N{U+....}..)");
is("aa\LbB\N{U+0041}cC", "aa" . lc("bB\N{U+0041}cC"), "\\N{U+....} apllied before case modifiers (..\\L..\\N{U+....}..)");
is("aa\FbB\N{U+0041}cC", "aa" . fc("bB\N{U+0041}cC"), "\\N{U+....} apllied before case modifiers (..\\F..\\N{U+....}..)");
# \c
is("aa\UbB\cbcC", "aa" . uc("bB\cbcC"), "\\c. apllied before case modifiers (..\\U..\\c...)");
is("aa\LbB\cbcC", "aa" . lc("bB\cbcC"), "\\c. apllied before case modifiers (..\\L..\\c...)");
is("aa\FbB\cbcC", "aa" . fc("bB\cbcC"), "\\c. apllied before case modifiers (..\\F..\\c...)");
is("aa\UbB\cBcC", "aa" . uc("bB\cBcC"), "\\c. apllied before case modifiers (..\\U..\\c...)");
is("aa\LbB\cBcC", "aa" . lc("bB\cBcC"), "\\c. apllied before case modifiers (..\\L..\\c...)");
is("aa\FbB\cBcC", "aa" . fc("bB\cBcC"), "\\c. apllied before case modifiers (..\\F..\\c...)");
# \o{.....}
is("aa\UbB\o{23072}cC", "aa" . uc("bB\o{23072}cC"), "\\o{...} apllied before case modifiers (..\\U..\\o{...}..)");
is("aa\LbB\o{23072}cC", "aa" . lc("bB\o{23072}cC"), "\\o{...} apllied before case modifiers (..\\L..\\o{...}..)");
is("aa\FbB\o{23072}cC", "aa" . fc("bB\o{23072}cC"), "\\o{...} apllied before case modifiers (..\\F..\\o{...}..)");
# \o...
is("aa\UbB\141cC", "aa" . uc("bB\141cC"), "\\... (octal) apllied before case modifiers (..\\U..\\.....)");
is("aa\LbB\101cC", "aa" . lc("bB\101cC"), "\\... (octal) apllied before case modifiers (..\\L..\\.....)");
is("aa\FbB\101cC", "aa" . fc("bB\101cC"), "\\... (octal) apllied before case modifiers (..\\F..\\.....)");

# "inner" case modifiers take precedence over "outer" case modifiers but
# with some caveats:
#     - special case exist for "\U\l"
#     - special case exist for "\L\u"
#     - "outer" case modifier is applied before the "inner" case modifier
is("aa\UbB\lCc", "aa" . uc("bB" . lcfirst("Cc")), "inner modifier overrides outer modifier (..\\U..\\l..)");
is("aa\LbB\udD", "aa" . lc("bB" . ucfirst("dD")), "inner modifier overrides outer modifier (..\\L..\\u..)");
is("aa\FbB\l${fc_char}Cc", "aa" . fc("bB" . lcfirst("${fc_char}Cc")), "inner modifier overrides outer modifier (..\\F..\\l..)");
is("aa\FbB\u${fc_char}dD", "aa" . fc("bB" . ucfirst("${fc_char}dD")), "inner modifier overrides outer modifier (..\\F..\\u..)");

# special cases: \U\l and \L\u
is("aa\U\lbB", "aa" . lcfirst(uc("bB")), "inner modifier does not override outer modifier (..\\U\\l..)");
is("aa\L\ubB", "aa" . ucfirst(lc("bB")), "inner modifier does not override outer modifier (..\\L\\u..)");
# not-special: \F\l and \F\u
is("aa\F\l${fc_char}Cc", "aa" . fc(lcfirst("${fc_char}Cc")), "inner modifier overrides outer modifier (..\\F\\l..)");
is("aa\F\u${fc_char}bB", "aa" . fc(ucfirst("${fc_char}bB")), "inner modifier overrides outer modifier (..\\F\\u..)");

# To test that the "outer" case modifier was applied the 'LATIN SMALL LETTER DOTLESS I'
# can be used:
#       "\N{LATIN SMALL LETTER DOTLESS I}"         = "\x{0131}"
#       lc("\N{LATIN SMALL LETTER DOTLESS I}")     = "\x{0131}"
#       fc("\N{LATIN SMALL LETTER DOTLESS I}")     = "\x{0131}"
#       uc("\N{LATIN SMALL LETTER DOTLESS I}")     = "I" (== "\x{49}")
#       lc(uc("\N{LATIN SMALL LETTER DOTLESS I}")) = "i" (== "\x{69}")
#       fc(uc("\N{LATIN SMALL LETTER DOTLESS I}")) = "i" (== "\x{69}")
# In other words:
#       lc("\N{LATIN SMALL LETTER DOTLESS I}") ne lc(uc("\N{LATIN SMALL LETTER DOTLESS I}"))
isnt(lc("\N{LATIN SMALL LETTER DOTLESS I}"), lc(uc("\N{LATIN SMALL LETTER DOTLESS I}")), "lc() not equal to lc(uc()) for 'LATIN SMALL LETTER DOTLESS I'");
isnt(fc("\N{LATIN SMALL LETTER DOTLESS I}"), fc(uc("\N{LATIN SMALL LETTER DOTLESS I}")), "fc() not equal to fc(uc()) for 'LATIN SMALL LETTER DOTLESS I'");
is("aa\LbB\u\N{LATIN SMALL LETTER DOTLESS I}", "aa" . lc("bB" . ucfirst("\N{LATIN SMALL LETTER DOTLESS I}")), "..\\L..\\u.. first converts to upercase");
is("aa\FbB\u\N{LATIN SMALL LETTER DOTLESS I}", "aa" . fc("bB" . ucfirst("\N{LATIN SMALL LETTER DOTLESS I}")), "..\\F..\\u.. first converts to upercase");

# There does not appear to be a character where `uc(lc($x)) ne uc($x))` :-(
# -> can't test that "\U..\l.." does a `lcfist()`


# "inner" case modifiers do not stack
is("aa\UbB\LcC", "aa" . uc("bB") . lc("cC"), "no stacking for inner case modifiers (..\\U..\\L..)");
is("aa\UbB\FcC", "aa" . uc("bB") . fc("cC"), "no stacking for inner case modifiers (..\\U..\\F..)");
is("aa\LbB\UcC", "aa" . lc("bB") . uc("cC"), "no stacking for inner case modifiers (..\\L..\\U..)");
is("aa\LbB\FcC", "aa" . lc("bB") . fc("cC"), "no stacking for inner case modifiers (..\\L..\\F..)");
is("aa\FbB\LcC", "aa" . fc("bB") . lc("cC"), "no stacking for inner case modifiers (..\\F..\\L..)");
is("aa\FbB\UcC", "aa" . fc("bB") . uc("cC"), "no stacking for inner case modifiers (..\\F..\\U..)");

is("aa\UbB\UcC\EdD", "aa" . uc("bB") . uc("cC") . "dD", "no stacking for inner case modifiers (..\\U..\\U..\\E..)");
is("aa\UbB\LcC\EdD", "aa" . uc("bB") . lc("cC") . "dD", "no stacking for inner case modifiers (..\\U..\\L..\\E..)");
is("aa\UbB\FcC\EdD", "aa" . uc("bB") . fc("cC") . "dD", "no stacking for inner case modifiers (..\\U..\\F..\\E..)");
is("aa\LbB\LcC\EdD", "aa" . lc("bB") . lc("cC") . "dD", "no stacking for inner case modifiers (..\\L..\\L..\\E..)");
is("aa\LbB\UcC\EdD", "aa" . lc("bB") . uc("cC") . "dD", "no stacking for inner case modifiers (..\\L..\\U..\\E..)");
is("aa\LbB\FcC\EdD", "aa" . lc("bB") . fc("cC") . "dD", "no stacking for inner case modifiers (..\\L..\\F..\\E..)");
is("aa\FbB\FcC\EdD", "aa" . fc("bB") . fc("cC") . "dD", "no stacking for inner case modifiers (..\\F..\\F..\\E..)");
is("aa\FbB\LcC\EdD", "aa" . fc("bB") . lc("cC") . "dD", "no stacking for inner case modifiers (..\\F..\\L..\\E..)");
is("aa\FbB\UcC\EdD", "aa" . fc("bB") . uc("cC") . "dD", "no stacking for inner case modifiers (..\\F..\\U..\\E..)");


# Quotemeta does stack
is("aa\Q1+2\Q3+4\E5+6\E7+8", "aa" . quotemeta("1+2" . quotemeta("3+4") . "5+6") . "7+8", "stacking quotemeta (..\\Q..\\Q..\\E..\\E..)");

# inner modifier does not end quotemeta
is("aa\Qb+b\Uc+C", "aa" . quotemeta("b+b" . uc("c+C")), "inner modifier not ending quotemeta (..\\Q..\\U..)");
is("aa\Qb+b\Lc+C", "aa" . quotemeta("b+b" . lc("c+C")), "inner modifier not ending quotemeta (..\\Q..\\L..)");
is("aa\Qb+b\Fc+C", "aa" . quotemeta("b+b" . fc("c+C")), "inner modifier not ending quotemeta (..\\Q..\\F..)");

is("aa\Qb+B\Uc+C\Ed+D\Ef+F", "aa" . quotemeta("b+B" . uc("c+C") . "d+D") . "f+F", "\\E ends inner modifier (..\\Q..\\U..\\E..\\E..)");
is("aa\Qb+B\Lc+C\Ed+D\Ef+F", "aa" . quotemeta("b+B" . lc("c+C") . "d+D") . "f+F", "\\E ends inner modifier (..\\Q..\\L..\\E..\\E..)");
is("aa\Qb+B\Fc+C\Ed+D\Ef+F", "aa" . quotemeta("b+B" . fc("C+C") . "d+D") . "f+F", "\\E ends inner modifier (..\\Q..\\E..\\E..\\E..)");

# quotemeta doesn't terminate inner case modifier
is("aa\Ub+B\Qc+C", "aa" . uc("b+B" . quotemeta("c+C")), "quotemeta doesn't terminate inner modifier (..\\U..\\Q..)");
is("aa\Lb+B\Qc+C", "aa" . lc("b+B" . quotemeta("c+C")), "quotemeta doesn't terminate inner modifier (..\\U..\\Q..)");
is("aa\Fb+B\Qc+C", "aa" . fc("b+B" . quotemeta("c+C")), "quotemeta doesn't terminate inner modifier (..\\U..\\Q..)");
is("aa\Ub+B\Qc+C\Ed+D", "aa" . uc("b+B" . quotemeta("c+C") . "d+D"), "\\E ends quotemeta (..\\U..\\Q..\\E..)");
is("aa\Lb+B\Qc+C\Ed+D", "aa" . lc("b+B" . quotemeta("c+C") . "d+D"), "\\E ends quotemeta (..\\L..\\Q..\\E..)");
is("aa\Fb+B\Qc+C\Ed+D", "aa" . fc("b+B" . quotemeta("c+C") . "d+D"), "\\E ends quotemeta (..\\F..\\Q..\\E..)");

# inner modifier ends quotemeta
is("aa\Ub+B\Qc+C\Ud+D", "aa" . uc("b+B" . quotemeta("c+C")) . uc("d+D"), "inner modifier ends quotemeta (..\\U..\\Q..\\U..)");
is("aa\Ub+B\Qc+C\Ld+D", "aa" . uc("b+B" . quotemeta("c+C")) . lc("d+D"), "inner modifier ends quotemeta (..\\U..\\Q..\\L..)");
is("aa\Ub+B\Qc+C\Fd+D", "aa" . uc("b+B" . quotemeta("c+C")) . fc("d+D"), "inner modifier ends quotemeta (..\\U..\\Q..\\F..)");
is("aa\Lb+B\Qc+C\Ud+D", "aa" . lc("b+B" . quotemeta("c+C")) . uc("d+D"), "inner modifier ends quotemeta (..\\L..\\Q..\\U..)");
is("aa\Lb+B\Qc+C\Ld+D", "aa" . lc("b+B" . quotemeta("c+C")) . lc("d+D"), "inner modifier ends quotemeta (..\\L..\\Q..\\L..)");
is("aa\Lb+B\Qc+C\Fd+D", "aa" . lc("b+B" . quotemeta("c+C")) . fc("d+D"), "inner modifier ends quotemeta (..\\L..\\Q..\\F..)");
is("aa\Fb+B\Qc+C\Ud+D", "aa" . fc("b+B" . quotemeta("c+C")) . uc("d+D"), "inner modifier ends quotemeta (..\\F..\\Q..\\U..)");
is("aa\Fb+B\Qc+C\Ld+D", "aa" . fc("b+B" . quotemeta("c+C")) . lc("d+D"), "inner modifier ends quotemeta (..\\F..\\Q..\\L..)");
is("aa\Fb+B\Qc+C\Fd+D", "aa" . fc("b+B" . quotemeta("c+C")) . fc("d+D"), "inner modifier ends quotemeta (..\\F..\\Q..\\F..)");

# empty var is not an error
my $foo = "";
is("aa\U$foo\UbB", "aa" . uc("bB"), "repeating modifier after empty var is not an error (..\\U\\$foo\\\U..)");
is("aa\U$foo\LbB", "aa" . lc("bB"), "repeating modifier after empty var is not an error (..\\U\\$foo\\\L..)");
is("aa\U$foo\FbB", "aa" . fc("bB"), "repeating modifier after empty var is not an error (..\\U\\$foo\\\F..)");
is("aa\L$foo\UbB", "aa" . uc("bB"), "repeating modifier after empty var is not an error (..\\L\\$foo\\\U..)");
is("aa\L$foo\LbB", "aa" . lc("bB"), "repeating modifier after empty var is not an error (..\\L\\$foo\\\L..)");
is("aa\L$foo\FbB", "aa" . fc("bB"), "repeating modifier after empty var is not an error (..\\L\\$foo\\\F..)");
is("aa\F$foo\UbB", "aa" . uc("bB"), "repeating modifier after empty var is not an error (..\\F\\$foo\\\U..)");
is("aa\F$foo\LbB", "aa" . lc("bB"), "repeating modifier after empty var is not an error (..\\F\\$foo\\\L..)");
is("aa\F$foo\FbB", "aa" . fc("bB"), "repeating modifier after empty var is not an error (..\\F\\$foo\\\F..)");

# outer case modifier immediately after inner case modifier not an error
is("aa\U\ubB", "aa" . uc("bB"), "outer modifier after inner modifier not an error (..\\U\\u..)");
is("aa\U\lbB", "aa" . lcfirst(uc("bB")), "outer modifier after inner modifier not an error (..\\U\\l..)"); # special, translated to \l\U
is("aa\L\uCc", "aa" . ucfirst(lc("Cc")), "outer modifier after inner modifier not an error (..\\L\\u..)"); # special, translated to \u\L
is("aa\L\lCc", "aa" . lc("Cc"), "outer modifier after inner modifier not an error (..\\L\\l..)");
is("aa\F\ubB", "aa" . fc("bB"), "outer modifier after inner modifier not an error (..\\F\\u..)");
is("aa\F\lbB", "aa" . fc("bB"), "outer modifier after inner modifier not an error (..\\F\\l..)");


done_testing();

[^1]: This can be tested/seen when testing with the LATIN SMALL LETTER DOTLESS I character: $ perl -wle ' my $x = "\N{LATIN SMALL LETTER DOTLESS I}"; print "\Lfoo\u$x" eq "\Lfoo$x" ? "True" : "False"; print "\Lfoo\u$x" eq lc("foo$x") ? "True" : "False"; print "\Lfoo\u$x" eq lc("foo" . ucfirst($x)) ? "True" : "False";' False False True

[^2]: Without looking at the code: what happens first is that \U\l is replaced with \l\U, so the string becomes: \l\U\E\Ubar which causes it to cancel the first \U and not the \l. This can also be seen in some error messages: $ perl -e '"\U\l\Ubar"' syntax error at -e line 1, near "\l\U\U" => Code was: \U\l\U error is: \l\U\U

Aug 05 '22 08:08 bram-perl

This is amazing work. Thanks @bram-perl!

Aug 05 '22 09:08 demerphq

I think the decision to focus on double quoted strings is very smart. I do not think the toker should be involved with any of the \L \U or equivalent functions at all inside of regex quoting, and we should not do any toker level code generation from them. The regex parser can handle them all internally and mirror the behavior that we decide for double quoted strings while taking into account the subtleties of the regex engine interpreting escapes differently (or the same).

So a good example is \x{7c}. We define that there is no difference between the double quoted strings "|" and "\x{7c}". But in the regex engine the two are very different. /a|b/ means match "a" or match "b". Whereas /a\x{7c}b/ means match the literal string "a|b". So we don't want /\L\x{41}\x{7c}\x{42}/ (which is "A|B" in \x{} notation) turning into /a|b/, it should turn into /a\x{7c}\b/ and really the only sane place to make such discrimination would be in the regex engine parser itself. \Q might be safe to leave to the toker, but given the others almost certainly should move to the regex parser we might as well move \Q as well, in fact moving any to the regex parser probably means we have to move all of them given the way they interact in your doc.

So I agree with you that we should consider the issues of these escapes solely from the point of double quoted strings. Afterwards we can make the regex engine do the appropriately equivalent thing when it parses.

UPDATE: An interesting question is what should this: /\l(A|B)/ match? should it be equivalent to /(a|b)/ by meaning 'lowercase the first thing I match" or should it be equivalent to /(a|B)/ (lowercase the first thing I can match) or should it be equivalent to /(A|B)/ lowercase the immediate next character? (because '(' has no uppercase)? Should it warn if it is used on a regex metacharacter? In the regex parser we would have the option to do any of these, in the toker many wouldn't be possible at all.

Aug 05 '22 13:08 demerphq

An interesting question is what should this: (...)

It certainly is an interesting question (and it would need a clear answer before the regex implementation is changed) but I would try to leave it out of this issue/this discussion for the time being;

I think it would be better to first get a list of all the quirks, caveats, edge cases, ... that are involved in the double quoted string and then see if some of those can be deprecated (or discouraged or warned about) to make the rules simpler/more straight-forward.

(I do have some simplifications in mind but I'll leave those for a p5p discussion to see if there is an agreement on them)

Aug 05 '22 14:08 bram-perl

but I would try to leave it out of this issue/this discussion for the time being;

Sorry, I didn't mean that to sound like we should discuss that question in this ticket, it was intended as another example of how what is sensible in a double quoted string might not be the sensible thing in a regex, and that the questions for regexes are more complex, so focusing on double quoted strings as this ticket does is really very justified, and should happen before we address regexen.

Eg I agree totally, lets NOT debate that one here. Its just an example of how the debate is different in the two contexts.

Aug 05 '22 14:08 demerphq

Just for the record the message @bram-perl mentions here has the message id: <CANgJU+Vr4NCpdxDdmcf2-6R2wF0BwrykgH3_ww9tUDqdnwo+mQ@mail.gmail.com>

Jan 29 '23 16:01 demerphq

perl5 perl5 copied to clipboard

Interaction of case-modifiers (\U, \L, \u, \l, \F, \Q, \E) in double quoted strings

perl5
perl5 copied to clipboard