problem-solving
problem-solving copied to clipboard
Clean interpolation for enumerated character classes
The regular expression syntax allows enumerated character classes, such as <[abc]> for the letters a, b, c, and <-[abc]> for any characters other than a, b, or c. However given a variable such as "my $letters='abc'", it does NOT allow a clean way to interpolate that - <[$letters]> is NOT the same as <[abc]>, instead it is a class of the punctuation '$' and l, e, t, r, s with a warning about the repeated "e" and "t"
The documentation at http://docs.perl6.org/language/regexes#Enumerated_character_classes_and_ranges hints that enumerated character classes create a single-quoted-like context, treating metacharacters literally, but doesn't explicitly say so. There is already a doc issue open to clarify the documentation at https://github.com/perl6/doc/issues/2999
Regardless of the doc resolution, Perl 6 regexes should have a Perl 6-ish way to interpolate values into enumerated character classes.
I have a couple ideas, and am presenting the simple case of a regex with just the one character class in it. These proposals should also allow - and + for union, intersection.
Proposal q- Have quoting constructs as alternative enumerating brackets. An adverb :regex on the construct asks the result to be interpreted as a regex.
EG my $sample= 'a-d'; / <[$sample]> / # no change, this is a character class with $, s, a, m, p, l, e / <q'$sample'> / # class with $, s, a, m, p, l, e. the "q" lets user pick the delimiter / <qq"$sample"> / # class with characters a, d, and hyphen -. / qq:regex"$sample" / # class with characters a, b, c, and d. / <q:regex'$sample'> # class with characters a, d, and hyphen -.
Proposal 0 "no change" other than having the existing syntax "just work"... if possible... ? I think not but the idea is / <$sample> / # Existing interpolation, matches literal string 'a-d' / < $sample > / # class with a, d, and hyphen -. Doesn't work because < > here becomes quoted-word-alternation construct. / < <$sample> > / class with a, b, c, and d. / < $( 'a' ~ '-d' ) > / class with characters a, d, and hyphen -. / < <{ 'a' ~ '-d' }> > / class with a, b, c, and d.
The problem with Proposal 0 is with the outer angle brackets. I like this concept, just not the exact syntax.
What about:
my @letters = <a b c d e>; say ‘bah’ ~~ /@letters/ # 「b」
This works now.
Alternatively:
my $letters = ‘abcde’; say ‘bah’ ~~ /<{“<[$letters]>”}>/ # 「b」
my @letters = <a b c d e>; say ‘bah’ ~~ /@letters/
# 「b」
This is not a character class and in particular doesn't work with union/intersection to build up a character class, eg /-@letters/ is not the same as <-[a-e]>
my $letters = ‘abcde’; say ‘bah’ ~~ /<{“<[$letters]>”}>/
That's a workaround which is a little "dirty" - you'll have to backslash ']', and possibly '-' depending on how literal you want to interpret the contents - and also doesn't work cleanly for mixing literal and interpolated classes for union/intersection eg
"match what's in letters except c-m" = /<{"<[ " ~ $letters.subst(']','\]') ~ "] - [c-m]>"}>/
that workaround doesn't seem very clean to me.
Since <+[abc]>
is longhand for <[abc]>
(as a case of the charclass addition/subtraction), then perhaps <+@foo>
(and, naturally, <-@foo>
for the inverse) would be most natural. Then you can do the usual charclass math on them too, e.g. <[a..z]+@punct-@excluded>
.
@jnthn what if the array has something other than single characters? Like multi character strings or empty strings?
@AlexDaniel Probably the same as for <+foo>
if foo
matches more than one char: it has to match, but we only advance the cursor by 1 position however much it matches.
Ah, and the empty string always matches.
Have some thoughts about <+@letters>
proposal
- There's a mismatch between the string in literal character classes, versus the list in the proposal. It feels like an inconsistency to me.
- Based on http://docs.perl6.org/language/regexes#Quoted_lists_are_LTM_matches specifically "Arrays can also be interpolated into a regex to achieve... the longest-match alternation of the list's elements" - I would expect the following to happen with
<+
or<-@classes>
syntax:
my @consistent_case_hex='ABCDEF1234567890', 'abcdef1234567890';
say so 'dada' ~~ /^<+@consistent_case_hex>*$/; # True
say so 'DADA' ~~ /^<+@consistent_case_hex>*$/; # True
say so 'Dada' ~~ /^<+@consistent_case_hex>*$/; # False
- That proposal doesn't address how to allow interpreting literally vs as a regex, eg does
a-f
mean the three characters-,
a
,f
or the six charactersa
b
c
d
e
f
... and how can the developer specify which intention?
There's a mismatch between the string in literal character classes, versus the list in the proposal. It fells like an inconsistency to me.
There is no string. More generally, I think various suggestions that have been made are based around a misunderstanding - namely, thinking that a regex is really much like a string, and adding extra things into it is a kind of string concatenation. This is actually true of regexes in most languages, where they are not a true first-class citizen. In Perl 6, they are; regexes are parsed and compiled in the very same pass as the rest of the code. Thus by runtime, when the interpolation we wish to achieve happens, the character class contents has been analyzed and compiled. Things like a..z
no longer exist in the string-y sense, for example.
my @consistent_case_hex='ABCDEF1234567890', 'abcdef1234567890';
The array elements should be individual graphemes.
That proposal doesn't address how to allow interpreting literally vs as a regex, eg does a-f mean the three characters -, a, f or the six characters a b c d e f ... and how can the developer specify which intention?
It didn't intend to address "as a regex", since ranges are easy in Perl 6 too: <[a..z]>
(note Perl 6 regexes don't use -
for this) would be <+@('a'..'z')>
in the interpolating form (of course, you can assign form that array elsewhere in the code).
You've given me good ideas for a new proposal- starting from your <+
... >
or <-
... >
suggestion. (Oh and thanks for the ..
vs -
reminder! Think-o there...)
Taking the "Principal of least surprise" - adhering to the goal of reducing special cases - the regex slang already parses the interior of <+
... >
or <-
... >
as a character class. (Similar for <:uniprop +
... >
)
So I propose that simply, after a character class "+/-" set union or difference op, $identifier
be interpreted as a literal list of characters to insert (or remove) - with no need to backslash-escape >
or ]
or anything else, every character literally interpreted.
In other words:
my $good-letters = 'abcdef';
say "I'm...fabulous" ~~ / <+ $good-letters > + /; # fab
Which is pretty close to what I expected to "just work" when I started down this path a month ago.
Part of me wants 'a..f'
to be a special case in the above, but reducing special cases is a goal, and the regex slang already suggests a syntax for that, <+ <$identifier> >
my $dot-letters = 'a..f';
say "I'm...fabulous" ~~ / <+ <$dot-letters> > + /; # fab
# The next line warns,
# Potential difficulties: Repeated character (.) unexpectedly found in character class- did you mean + <$dot-letters>?
say "I'm...fabulous" ~~ / <+ $dot-letters > + /; # ...fa
The above two modifications make a lot of sense to me. I can also see an orthogonal case for code interpolation in the character class slang-
Code in $(code)
interpolates result with all characters literal, <{code}>
interpolation creates ranges from ..
my $dot-lower = 'a..f';
my $hex-rx = rx/ <[0..9] + <{ $dot-lower.uc }> + <$dot-lower> > /; # same as <xdigit>
my $unihex = rx/ <:De + <{ $dot-lower.uc }> + <$dot-lower> > /; # ALL digits and <[A..Fa..f]>
my $other-tx = rx/ <[0..9] + $( $dot-lower.uc ) > /; # warns, same as <[0..9AF.]>
How about that? It keeps conventions from the existing regex slang and seems compatible- those forms currently don't compile.
Modifying my most recent proposal- treating double-dots as a repeated period causing a warning seems confusing. Thus -
my $dot-letters = 'a..f';
say "I'm...fabulous" ~~ / <+ $dot-letters > + /; # fab
and
my $dot-lower = 'a..f';
my $hex-rx = rx/ <[0..9] + $( $dot-lower.uc ) + $dot-lower > /; # same as <xdigit>
This opens an interesting possibility for <+ <$interperet-me> >
and <+ <{ ...code here ...}>
- they could now interpret expressions eg
# contrived example for <[0..9a..f]> from data
my $lower-hex-as-string = '<xdigit> - [A..F]';
my $hex-lower = rx/ <+ <$lower-hex-as-string> > /;
my $hex-digit = rx/ <+ <{ $lower-hex-as-string.uc } + $hex-lower >;
Greetings,
How would a new double angle-bracket "< < ... > >" construct with a "+" or "-" embedded within the left two angles affect French users, who might routinely use "« ... »" quotes ("French quotes")?
Would French users find such a construct awkward to use, or awkward to edit?
Thank you.
@jubilatious1 it shouldn't affect anybody. If someone doesn't have <
or >
on their layout, or if these are too hard to type, they'll have much bigger problems.
@AlexDaniel the French quotes already show up as a construct in the p6doc index:
https://docs.perl6.org/language/regexes#index-entry-regex__%3C%3C-regex_%3E%3E-regex_%C2%AB-regex_%C2%BB
Left and right word boundary:
"<<" matches a left word boundary. ... . ">>" matches a right word boundary. ... . These are both zero-width regex elements. You can also use the variants "«" and "»" .
I'd like to say the "<+<" or "<-<" constructs sound like a nice resolution to the issues Yary has uncovered, but it doesn't feel like a natural progression. In the one case "<< ... >>" denotes a word boundary, while in the second case--addition of a single character to "<+< ... >>" denotes a character class.
I feel that people reading complicated regexes would often run into comprehension issues, especially if they're reading it quickly, or if they are unfamiliar with one (or the other) construct.
Best Regards.
The syntax I gravitate towards is the existing regex interpolation syntax, adapted to work inside the enumerated character classes. Perhaps I should limit this to the first, simplest existing regex interpolation: <+ $interpolate-me>
and ignore the other variations <+ <( ...code here...)> >
, <+ <$interperet-me> >
, <+ <{ ...code here ...}>
this is the same
my $dot-letters = 'a..f';
say "I'm...fabulous" ~~ / <+ $dot-letters > + /; # fab
The others are left as room to expand...
So I've been playing around a bit with the last two lines of code you just posted, and it appears that wrapping a properly-formed range in @(...)
is sufficient to get the desired behavior (below from the Perl6/Raku REPL):
> put 'a..f';
a..f
> put 'a'..'f';
a b c d e f
> # Create $dot-letters2 since $dot-letters appears malformed
> my $dot-letters2 = 'a'..'f';
"a".."f"
> put $dot-letters2.elems;
6
> put $dot-letters2.^name;
Range
> say "I'm...fabulous" ~~ / @($dot-letters2)+ /; # 「fab」
「fab」
What I'm wondering is whether-or-not you believe there should be a short-cut function for transforming ("enumerating") a collection of characters into a character class. Something that is dramatically simpler than the clever "matching_chars" workaround you posted from Sept. 1st:
https://www.nntp.perl.org/group/perl.perl6.users/2019/09/msg6965.html
Also see the code you and @AlexDaniel posted above on September 4th for similar/other ideas. Maybe something along the lines of renaming your "matching_chars" function as ø
(latin-small-letter-o-with-stroke) or ⌀
(diameter sign), to transform the complicated bracketing in the first line of code below into the second?
> my $letters = 'abcde'; say 'balrog' ~~ / <{"<[$letters]>"}>+ /; # 「ba」
「ba」
> my $letters = 'abcde'; say 'balrog' ~~ / ø($letters)+ /; # want 「ba」 but returns Nil
Nil
I agree that @(...)
is a viable workaround for a simple case. But @(...)
isn't a character class and doesn't work with character class set operations. In fact the initial use I had for this was in the form of <- $letter >
- in other words, a character class of everything except for what is in $letter
Also @(...)
is a distraction from the issue of variable interpolation as documented not having any documented function inside character classes. I'm working towards documenting reasonable semantics for variable interpolation for character classes.
Having other functions to generate a character class regular expression is interesting. It may be useful if they allowed the same set union, difference operations as literal character classes. Regardless, it seems a separate issue from interpolating variables within a character class definition.
How easy is it to modify the Regex grammar? I haven't played around with it, but if it's more or less the same as monkeying around with Raku's grammar, then it should be very possible (I'm not going to say easy, since I haven't looked at it) to create a mock-up of the syntax and initially push it out as a module. If it works well and doesn't cause any problems, then it could be incorporated into core.
Also, to me, if the +@array
syntax is used, there should be only one of two interpretations: first grapheme of each element @array».substr(0,1).unique
, or all unique graphemes (eg. @array.join.comb.unique
)
Where does this issue stand?
Wondering how to combine the <:ASCII>
designation with other (select, Unicode) characters to produce a custom character class. Hoping to advance beyond @codesections blog post examples here:
https://www.codesections.com/blog/raku-unicode/
EDIT: my attempt at answering a SO post without this issue being resolved (see last code block at the post below):
https://unix.stackexchange.com/a/758525/227738
Circling back to this again...
@alabamenhu wrote:
Also, to me, if the
+@array
syntax is used, there should be only one of two interpretations: first grapheme of each element@array».substr(0,1).unique
, or all unique graphemes (eg.@array.join.comb.unique
)
Lots of suggestions for using @
-sigiled coercion, with-or-without a +
sign preceding.
But does that even make sense? A custom character class is by definition a Set
, with only unique values allowed (i.e. no duplicates). So technically we're looking for the infix (|), infix ∪ Union of values, which as a meta operator would be written:
[∪]( ... )
or:
[(|)]( ... )
(The second one looks bizarre. But hey, it's Raku).
I suggested in an earlier post using either the ø
(latin-small-letter-o-with-stroke) or ⌀
(diameter sign) as a new function within the Regex matcher. But maybe (going with the Set
idea here) the Empty Set symbol term ∅ might be better?
We already have symbols with special meanings within Regex matchers, so term ∅ (i.e. the Empty Set) could designate new function to be used only within Regexes. Why this symbol? The reasoning is as follows: Raku disallows null
regexes. So an Empty Set can't exist within Raku Regexes. So the symbol term ∅ can be used for something different, like @fecundf 's "matching-characters" function. In other words,
Write:
/ ∅($letters)+ /;
Instead of:
/ <{"<[$letters]>"}>+ /;
You could think of a prefix ∅
as "trying to empty the set that follows".
@jnthn
Wait, before we dive too deeply into proposed changes to syntax, I'd like to clarify what behavior the current syntax is supposed to allow.
Earlier, @jnthn was asked how a regex would interpret the proposed <+@foo>
syntax if @foo
had elements with more than one character; he replied:
Probably the same as for <+foo> if foo matches more than one char: it has to match, but we only advance the cursor by 1 position however much it matches.
But that's not how <+foo>
is treated right now (though I haven't checked if it was when @jnthn said that in 2019). Look:
grammar G {
token TOP { <+foo +[A..Z]>}
token foo { foo };
}
say G.subparse('fooBAR'); # OUTPUT: «「foo」»
And changing TOP
to { <+[A..Z] +foo> }
also matches 「foo」
. It looks like foo
runs a LTM and and returns the full token, not just one character. Is it supposed to just return one? (Is <+foo>
even intentionally legal syntax, for that matter? Using tokens within character ranges isn't documented and, if you leave off the supposed-to-be-optional +
before foo
, Rakudo won't even parse the range.)
If the <+foo>
syntax is supposed to produce one-character matches, then I think fixing that bug and documenting that you can use tokens inside of character ranges would go a long way towards resolving this issue without adding any syntax. Instead of the proposed <[a..z] + @punct - @excluded>
(with the arrays defined earlier) we'd have <[a..z] + punct - excluded>
(with the tokens defined earlier). Is the fact that we don't have that behavior right now just a bug?
@codesections
~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.
To exit type 'exit' or '^D'
[0] > say $/ if "The cat in the hat" ~~ m:g/ <[ a a ]> /;
Potential difficulties:
Repeated character (a) unexpectedly found in character class
------> say $/ if "The cat in the hat" ~~ m:g/ <⏏[ a a ]> /;
(「a」 「a」)
[0] >
@jubilatious1 I'm talking about slightly different syntax: with the token outside the […]
.
say $/ if "The cat in the hat" ~~ m:g/ <+ident +[a]> /;
(「The」 「cat」 「in」 「the」 「hat」)
(Though note that it doesn't currently work with user-defined tokens outside of a grammar; again, I'm not sure if that's by design or due to a bug. Though if it's by design, the error message sure is LTA.)
@codesections
I appreciate the update but I'm not sure how useful it is to use ident
an an example, especially for casual users of Raku who only care about alpha
, alnum
, digits
and _
underscore. There seems even to be an inconsistency here regardless, note the differences below:
[0] > say $/ if "The cat in the hat" ~~ m:g/ <+ ident +[\s]> /;
(「The」 「 」 「cat」 「 」 「in」 「 」 「the」 「 」 「hat」)
[0] >
[0] > say $/ if "The cat in the hat" ~~ m:g/ <+ alpha +[\s]> /;
(「T」 「h」 「e」 「 」 「c」 「a」 「t」 「 」 「i」 「n」 「 」 「t」 「h」 「e」 「 」 「h」 「a」 「t」)
[0] >
How can ident
be considered a character-class if no individual characters are ever identified by its use?
@fecundf @2colours
'm not sure how useful it is to use
ident
an an example How canident
be considered a character-class if no individual characters are ever identified by its use?
That's the point of using ident
as an example: it's the only (built-in) token that triggers the bug. All of the other tokens match a single character and thus (correctly) consume only one character.
And (per jnthn's 2019 comment), the intended behavior for <+ident>
that it should also only consume one character/advance the match cursor by one place. But that's not what currently happens; currently, <+ident>
is consuming all characters that match against <ident>
(without the +
).
Or, to put it in code, <+ident>
should behave like [ . <?ident> ]
but currently behaves like <.ident>
. Here's an example based on your previous one that shows the difference.
# Current behavior
say $/ if "The cat in the hat" ~~ m:g/ <+ ident > /;
(「The」 「cat」 「in」 「the」 「hat」)
# Correct behavior, as jnthn described
say $/ if "The cat in the hat" ~~ m:g/ <+ ident > /;
(「T」 「c」 「i」 「t」 「h」)
After spending some time in Roast, I've concluded that, unsurprisingly, jnthn was correct. (Though of course this exact behavior isn't tested in Roast, or the bug would've been caught). Roast frequently tests that <+some-regex>
works, most clearly in S05-metasyntax/charset and S05-metasyntax/longest-alternative. The only problem is that all of the regexes Roast tests match against a single character so this bug isn't triggered. All of Raku's built-in regexes/character classes also match with just one character – except for <ident>
, which is why I used it for this example. But, of course, the syntax can be used with user-defined regexes, which frequently match only against multiple characters. And that's when the bug is triggered.
I'm convinced enough that the current behavior is a bug and that the correct behavior is undocumented that I'm going to open Rakudo, Roast, and Docs issues.
@codesections
While you're about it:
jnthn wrote:
Ah, and the empty string always matches.
What does that mean?
Hmm. Let's assume a character class expression <+foo +bar>
.
If foo
isn't null, but bar
is, then it's the same as foo
, so it makes sense that bar
matches. But if both foo
and bar
are null, then that ends up matching a single character. That feels off. Off the top of my head (always dangerous) I think it would be best that, if the overall expression matches zero characters, then the overall character class fails to match, rather than matching some arbitrary character.
What about <+foo -bar>
?
It feels like a null string in a character class should just be ignored rather than matching or nor matching.
I think I'd read "the empty string always matches" to mean that it matches but not necessarily that it consumes a character. That seems to be the current behavior: '42' ~~ /<+ws + alpha>/
matches but returns an empty match. That seems about right to me.
So maybe it'd be better to say that a successful character class consumes at most one character? (Aside: the Roast tests are in charset.t
; I'm starting to wonder if the docs would be better off using the term "character set" rather that "character class" – set gets at the semantics better, and "class" is confusingly overloaded.)
Probably a different issue, but it would be nice if the <.alpha>
"leading-dot" format could be respected in custom (i.e. enumerated) character classes (currently throws an error, see last example below):
[18] > "The cat in the hat".match(/ <.alpha>+ /);
「The」
[19] > "The cat in the hat".match(/ <alpha>+ /);
「The」
alpha => 「T」
alpha => 「h」
alpha => 「e」
[20] > "The cat in the hat".match(/ <+ alpha + [\s]>+ /);
「The cat in the hat」
[21] > "The cat in the hat".match(/ <+ .alpha + [\s]>+ /);
===SORRY!===
Unrecognized regex metacharacter < (must be quoted to match literally)
------> "The cat in the hat".match(/ <+⏏ .alpha + [\s]>+ /);
Unrecognized regex metacharacter + (must be quoted to match literally)
------> "The cat in the hat".match(/ <+⏏ .alpha + [\s]>+ /);
Unable to parse regex; couldn't find final '/'
------> "The cat in the hat".match(/ <+⏏ .alpha + [\s]>+ /);
[21] >
@raiph wrote:
It feels like a null string in a character class should just be ignored rather than matching or nor matching.
It just seems like there should be tests in Roast to confirm the following (are there already?):
<alnum> ~~ <+ alpha + digit>
<alpha> ~~ <+ alnum - digit>
<digit> ~~ <+ alnum - alpha>
#____
<graph> ~~ <+ alnum + punct>
<alnum> ~~ <+ graph - punct>
<punct> ~~ <+ graph - alnum>
#____
<print> ~~ <+ graph + space>
<graph> ~~ <+ print - space>
<space> ~~ <+ print - graph>
#____
etc.