zed icon indicating copy to clipboard operation
zed copied to clipboard

SQL: Postgres-compatible REGEXP_REPLACE function

Open philrz opened this issue 1 year ago • 0 comments

The REGEXP_REPLACE function is called the same way in ClickBench query 28 with Postgres, DuckDB, and ClickHouse. However, if we attempt the same call with SuperDB's REGEXP_REPLACE function it currently causes a parse error and significant adjustments need to be made to get it to return the same query result.

Strictly speaking, it appears REGEXP_REPLACE is not a formal part of the SQL spec. However, we've got a general goal to make SuperDB as Postgres-compatible as we can so users can see it as a viable drop-in replacement, such as for running their existing BI queries. I can see from the Postgres docs that their parameters cover even more ground than what's used by this query. Below I'll share some details about my experience as a user getting this one to work.

Details

Repro is with super commit 1d783cc.

To simplify, I'll use the attached referer.csv test data:

Referer
https://go.mail/folder-1/online/ru-en/#lingvo/#1О 50000&price_ashka/rav4/page=/check.xml
http://tambov.irr.ru/cgi-bin/news/?page.aspx
http://state=19945206/foto-4/login=Ivan_contazhigulevaolgdruzhy
http://www.circles/3973131509006588934

Here's DuckDB returning the expected result using the same REGEXP_REPLACE call as in the ClickBench query 28.

$ duckdb --version
v1.1.3 19864453f7

$ duckdb -c "SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') FROM 'referer.csv'"
┌──────────────────────────────────────────────────────────────────┐
│ regexp_replace(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') │
│                             varchar                              │
├──────────────────────────────────────────────────────────────────┤
│ go.mail                                                          │
│ tambov.irr.ru                                                    │
│ state=19945206                                                   │
│ circles                                                          │
└──────────────────────────────────────────────────────────────────┘

And here's the parse error from SuperDB if I attempt to execute the same query.

$ super -version
Version: v1.18.0-215-g1d783cc2

$ super -c "SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') FROM 'referer.csv' FORMAT csv"
parse error at line 1, column 50:
SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') FROM 'referer.csv' FORMAT csv
                                             === ^ ===

One way I got it to work is by surrounding the regexp in /, but this required:

  1. Adding a \ before each attempt to match a literal / inside the regexp
  2. Changing the reference to the substring match of the the parenthesized subexpression from \1 to \$1
$ super -c "SELECT REGEXP_REPLACE(Referer, /^https?:\/\/(?:www\.)?([^\/]+)\/.*$/, '\$1') FROM 'referer.csv' FORMAT csv"
{regexp_replace:"go.mail"}
{regexp_replace:"tambov.irr.ru"}
{regexp_replace:"state=19945206"}
{regexp_replace:"circles"}

I also was able to get it working by keeping the regexp as a string, but this required adding a \\ escape at the location of the parse error in addition to still making the \1 to \$1 change.

$ super -c "SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\\\.)?([^/]+)/.*$', '\$1') FROM 'referer.csv' FORMAT csv"
{regexp_replace:"go.mail"}
{regexp_replace:"tambov.irr.ru"}
{regexp_replace:"state=19945206"}
{regexp_replace:"circles"}

I don't claim perfect knowledge as to the history of the use of / as a regexp delimeter, but Wikipedia notes its common use with sed, Perl, and I'm also familiar with it from JavaScript. Meanwhile, it seems like all the major SQL implementations I can spot show just single-quoted strings. Likewise, it seems like the \1 syntax is consistently favored for referencing a match of a parenthesized subexpression. Therefore I expect we'd want to adopt these conventions also/instead in the pursuit of SQL compatibility.

philrz avatar Jan 09 '25 23:01 philrz