embucket-labs icon indicating copy to clipboard operation
embucket-labs copied to clipboard

REGEXP_REPLACE replacement ignores backreference \2 (backslash is stripped), only $2 works

Open YevheniiNiestierov opened this issue 2 months ago • 1 comments

Not sure calling this a "bug" is correct, but the behavior differs from Snowflake. In Embucket, backreferences in the replacement argument of REGEXP_REPLACE are not recognized when written as \n. The backslash appears to be consumed by SQL string parsing before reaching the regex engine, so '\2' turns into the literal character '2' (ASCII 50). Using $2 does substitute the second capturing group correctly

How to reproduce:

#basic two-group match
SELECT
  REGEXP_REPLACE('ab', '(a)(b)', '\2') AS got_backslash,
  REGEXP_REPLACE('ab', '(a)(b)', '$2') AS got_dollar;

Expected got_backslash = 'b' (substitute group 2), got_dollar = 'b'.

Actual (Embucket) got_backslash = '2' (literally the digit), got_dollar = 'b'.

#confirm the actual character produced by '\2'
SELECT ASCII(REGEXP_REPLACE('ab', '(a)(b)', '\2')) AS codepoint_backslash;

Actual (Embucket) codepoint_backslash = 50 (digit '2'), indicating the backslash was stripped before regex replacement

#real-world style pattern (domain extraction)
SELECT
  REGEXP_REPLACE('http://example.com/path',
                 '^https?://(www\\.)?([^/]+)/.*$',
                 '\2') AS got_backslash,
  REGEXP_REPLACE('http://example.com/path',
                 '^https?://(www\\.)?([^/]+)/.*$',
                 '$2') AS got_dollar;

Expected: \x02 $2 Actual: 2 example.com

Impact on clickbench results: Ported queries that rely on \n backreferences (for example query 29 in clickbench produce incorrect grouping keys (e.g., many rows collapse under '2' instead of the intended capture), corrupting aggregates.

YevheniiNiestierov avatar Oct 19 '25 14:10 YevheniiNiestierov

Connected to #1504 & #1569

DanCodedThis avatar Oct 20 '25 12:10 DanCodedThis