prql
prql copied to clipboard
Regex implementation
An infix operator is probably worthwhile given how often regex matching is used.
Most implementations use a normal function like REGEXP or REGEXP_CONTAINS.
Postgres uses ~=, and !~ for negation. It also allows a trailing * for case insensitivity, like ~*.
Given we use == & != for equality, we could use =~ & !~ for regex? Or possibly the "close but not the same" as postgres is confusing there.
This is the first operator we'll have which will have very different implementations by dialect. I think we can probably do that with a regex(pattern, expr) method on the DialectHandler.
Funny how we are discussing a similar thing in https://github.com/PRQL/prql/issues/1123.
When I first needed LIKE I literally tried company_name =~ "Amzo" and that didn't work. Just like you said, given how equality works in PRQL support for regexp through the =~, !~ and ~* operators makes total sense to me.
Someone brought this up on Discord. I'm planning to implement this soon (unless someone happens to want to take this, in which case lmk...).
It'll need to be in the compiler since we can't discriminate between dialects in the stdlib, and I'll have to find out which DBs use which functions. (The new integration tests will be really helpful for testing!)
I'm planning to implement it as "regex search" rather than "regex match" so "bar" =~ "foobarbaz" passes, and it would be necessary to pass "^bar$" =~ "foobarbaz" to get the equivalent match.
I'll close #1123 in favor of this — once this has been implemented for a while, we can assess whether we need something additional to this such as LIKE.
Update: thanks to @max-sixty we are in process of merging #2458, implementing regex match operator:
from tracks
filter (title ~= 'Love')
This looks very simple, to the point where one could naively assume ~= is only "does_string_contain" operator. This could lead them to try to do this:
from tracks
filter (title ~= 'Love (2)')
# search for tracks that have 2 in parenthesis in the title
... which would (of course) be interpreted as regex and would match Eternal Love 2 but not Eternal Love (2).
In its essence, the problem is that the pattern string contains an internal language with all of its complexities, but looks like plain string-to-string comparison on the surface.
I'm not saying that this is a deal breaker for the operator, but I want a confirmation that we are ok with this possible source of frustration for people matching text.
If instead we had function std.regex_match, that would be much more explicit and would offer more explanation into what's happening under the hood.
PS: SQL has similar problem with LIKE. It has an embedded language with much simpler syntax (and much less expressiveness), but could still confuse a naive user when searching for things like LIKE '5 * 6', since that would match all of these:
'5 * 6''5 / 6''5 + 12 / 6'
Yes I think that's my central issue with the LIKE syntax — it's not sufficiently powerful, nor is it totally simple; it mostly gets just used with %foo%
I agree with the filter (title ~= 'Love (2)') example — it would also be useful to have an in / contains (and maybe a startswith & endswith function).
Do you think it would be safer to start with a function rather than an operator? I generally would think so, though my prior was that regex was general enough that it's worth an operator.
If multiple slightly different pattern matches are implemented in the system, does it matter that it is hard to know what ~= translates to?
For example, DuckDB has LIKE, SIMILAR TO, regexp_full_match, regexp_matches...
The complication is that SIMILAR TO is slightly different from a regular expression.
The SIMILAR TO operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE, except that it interprets the pattern using a regular expression. Like LIKE, the SIMILAR TO operator succeeds only if its pattern matches the entire string; this is unlike common regular expression behavior where the pattern can match any part of the string.
https://duckdb.org/docs/sql/functions/patternmatching.html#similar-to
Ours is a regex search, which is the case they describe as "common regular expression behavior where the pattern can match any part of the string."
We can get the same behavior as a regex match with a regex search by anchoring the start & end — i.e. ^foo$ rather than foo...