problem-solving
problem-solving copied to clipboard
The fate of P5 regexes in a RakuAST world
We're getting to a point where we need to decide as a community, or as the RSC, or as the pumpkin, on what to do with the :P5
adverb on regexes.
The Perl to Raku guide states:
Please note that the Perl regular expression syntax dates from many years ago and may lack features that have been added since the beginning of the Raku project. Two such features not yet implemented in Raku for the P5 syntax are the Perl Unicode property matchers \p{} and \P{}.
I'm pretty sure that comment is already a few years old and more features have been added to Perl regexes since then.
Almost 4 years ago it was already suggested to remove them altogether.
Now that we are moving to RakuAST, we need to do a significant effort to support :P5
to provide at least the limited capabilities it has now. A further complicating factor may be that the :P5
slang is actually part of NQP.
As already established, there is not a lot of code in the ecosystem that is actually using the :P5
adverb. And if it's used, it should be relatively straightforward to provide patches to make them true Raku regexes (as the :P5
regexes are a subset of capabilities offered by Raku regexes).
On the other hand, we see developments like PolyglotRegexen that attempts to provide other regex slangs in Raku.
So it feels like it would make more sense to spend time on creating a public API for regex slangs to register as a quoting language slang. To remove the :P5
tests from roast. And to leave porting of the Perl regexen to the community (which then could possibly provide a quoting slang depending on Inline::Perl5
and thus become on par with Perl itself wrt regexes).
I agree with the direction suggested in your last paragraph.
Long story short, I also lean towards giving up on this and letting user space handle it.
I have admittedly not spent a lot of research with the parts of a regex system, anyways, a reminder: the P5 regex is supposedly not only about a different syntax but 1. some tokens have different nuanced meanings 2. even more importantly, capture groups are completely flat and only capture one occasion instead of producing lists when necessary.
I don't know if P5 regexes ever worked on those principles but even if they did, I see two cases: it was either just a slow wrapper around the Raku semantics, or it really needed some custom implementation. In the former case, a module can do that just as much; in the latter case, is it really worth spending development cycles on that, if it is really barely used?
(On the other hand, there used to be crippling backlash regarding any sort of change to the language that isn't essentially an extension. It would be good to know what has changed since, or how to mitigate this tension.)
So as the Polyglot::Regexen
dev, I should be able to create a Perl
slang (no need to call it 5, at this point, especially since 7 is likely looming) without too much trouble. One of the primary issues might be /e
or /ee
but I think that's a fairly understandable limitation (or support it with Raku in the code block). The difference in captures, (e.g. matching 'abcdef'
with /(..)*/
results in 'ef'
for $1
, not ('ab', 'cd', 'ef')
is fairly easily to handle (I already handle it for ECMA262), and perhaps could be made even more so if it were easy to generate a custom Match
class (maybe a role Matchable
?).
OTOH, it might not be bad to include PCRE
as an equivalent mode, this could be made to generically support the most common flavor of regex. One thing @codesections I think mentioned at my TPRC talk was that slangs, while powerful, can make syntax highlighting complicated. The reality is PCRE regexen are so dense syntax highlighting is really useful. While I've thought of a few solutions that might work for a more involved editor like Comma but not sure yet how it would be handled in other highlighters. [obligatory remark from Larry's roundtable session that good languages don't need highlighting)
If a pluggable system is enabled for regex in core, I think it'd be good to try to similarly make other DSLs more pluggable (most of the work would be the same, after all).
Glad to see it, so rakudo core can be language without any historical burden to maintain. If really needed, Module can be used to support these. So new comers need to care about the modern features only.
There's plenty of torture left for the implementors without P5
remaining in core. Let's distribute our limited resources on more impactful and popular features (slangs in the general case and macros come to mind).
That said, a PCRE
mode in core could be a useful feature, if it comes cheap. But a "sort of but not really" funhouse mirror of Perl's regex engine seems beyond superfluous at this point.
That said, a
PCRE
mode in core could be a useful feature, if it comes cheap. But a "sort of but not really" funhouse mirror of Perl's regex engine seems beyond superfluous at this point.
I think that can be done. There's nothing that PCRE can do that Raku can't AFAICT, the only real difference is capture counting and internal regex references (where you insert a named/numbered match as the regex, instead of the result of that one), which isn't insurmountable, although may be a bit slow because of some additional code blocks in a fairly naïve implementation. (And may be something we'd want to look to adding to Raku regex)
a "sort of but not really" funhouse mirror of Perl's regex engine seems beyond superfluous at this point.
I agree.
With apologies if the following is a thread hijack, but I've been pondering and staying quiet about closely related issues for over a decade. Please forgive me.
Since shortly after 2000 I've been pondering possible Raku killer apps once Raku(do) became sufficiently mature. By the time of the first official release about 7 years ago I tentatively settled on the idea of a Rakudo that's bundled with support for multiple regex engines as follows.
First, the implementation and language model would be the one established by the Inlines.
From an implementation perspective it's important to know that Stefan Seifert had his first version of what became Inline::Perl5
working in less than 24 hours, despite it being his first ever Raku program (the link is to one of my favorite lightning talks ever, just 3 minutes long). As he shows, if there's a C library version of a language implementation then it's possible to get things working with about a dozen lines of Raku code. The same will be true of regex engines -- and there's a C library version of all of them.
From a language perspective it's important to know that Stefan Seifert just started simple and then made things ever smoother. He did so by letting usage by users (including himself) drive the process, and leveraging (and developing) Raku(do)'s nature as a language gluing platform.
I may be over simplifying, but I would have thought that, especially once RakuAST becomes sufficiently mature (it may already be), it would be relatively simple for someone to create a module that introduces a new :re
adverb for the m/.../
statement. Maybe a very simple first cut/pretotype couldn't be done in a day like Stefan pulled off -- or maybe it could by someone with the right skills? Do you see what I see or am I just imagining things without a reasonable basis in reality?
Let me briefly sketch out what I'm hoping could be done in a day or three:
- Establish a "hello world" test. Let's say it's:
$_ = 'hello world'; m :re('pcre') /.*\s.*/; say $/; # ï½¢hello worldï½£
. - Write Raku code that allows an installed PCRE to be called. (Do the simplest thing that works. Look at what Stefan did in his video I linked above.)
- Create a slang that adds a
:re
adverb that arranges for the RakuAST generated by them :re('pcre') /.*\s.*/;
to be whatever is needed to use a compatible regular expressions engine, with the following sketching out how I think that could work. First, hardcode as if$_
was bound to the C string'hello world'
. Second, hardcode as if the regex code was the C string'.*\s.*'
. Third, invoke an installed PCRE, passing the two C strings. Fourth, marshal the C string returned by the PCRE engine call into a RakuStr
. Fifth, generate aMatch
object with thatStr
as its capture. Sixth, bind that to Raku's$/
variable. - Debug until the "hello world" test works.
- Let Rakoons know what you've done.
I'm confident some Rakoons know how to do this, and I'd be delighted to help anyone who's friendly. I consider my strengths to be things like writing technical doc, testing, and marketing.
If this thing comes to life it would need a name. I was tempted to not mention one because I don't want any discussion of this idea to be name bike shed painted into oblivion. But I can't ignore the risk. So instead I'll ask that anyone else proposing a name leaves that until the end of any comment they write that will hopefully be about something of substance before the name. And to establish a strawdog proposal (a likely poor name that any suggestion ought aspire to be better than), I'll suggest credo
, a portmanteau of a cre
acronym for "c
ompatible r
egular e
xpressions", and the do
bit of Rakudo.
I may be over simplifying, but I would have thought that, especially once RakuAST becomes sufficiently mature (it may already be), it would be relatively simple for someone to create a module that introduces a new
:re
adverb for them/.../
statement. Maybe a very simple first cut/pretotype couldn't be done in a day like Stefan pulled off -- or maybe it could by someone with the right skills? Do you see what I see or am I just imagining things without a reasonable basis in reality?
Very reasonable. See Polyglot::Regexen
. This is basically already available for ECMA262/JavaScript, but I've begun work on Perl as of this weekend. The JS stuff already even works within grammars.
I haven't yet tried plugging in with an adverb, but it shouldn't be too difficult to do it that's the preferred access manner long term. Or even could make the adverb :flavor<Perl>
or similar so we don't overly consume potential short names.
See
Polyglot::Regexen
.
I took a look, but that seems to be analogous to Tobias Leich's approach with his v5
work rather than to Stefan Seifert's approach with his Inline
work. Tobias worked extremely hard on v5
for a couple years and made substantial progress but it ultimately became clear that Stefan's approach was the way to go as the primary strategy.
I'm hopeful you're open to at least thinking about switching strategy to lean on Stefan's approach as the primary one, and Tobias's as the secondary approach when the primary approach isn't viable (because the given regex engine isn't installable on a system).
My interpretation of Stefan's work is that fills a different niche in the Raku ecosystem.
Sure, their work was yet another nail in the use v5
work, but that project was doomed for other reasons as well:
- Perl being a moving target: which version of Perl to support?
- Perl is more than syntax: it is also a large library of core and CPAN modules
- Most Perl code isn't portable anyway, because
use v5
would never support XS, and most, if not all, Perl programs are (in)directly dependent on XS
The second point I tried to work on with the CPAN Butterfly Plan, but that never got much traction.
To come back to the point: I see Tobias's approach, and Matthew's approach, as valuable as Stefan's approach. They're on different axes, and are both extremely valuable.