REGEXP_REPLACE replacement ignores backreference \2 (backslash is stripped), only $2 works
Not sure calling this a "bug" is correct, but the behavior differs from Snowflake. In Embucket, backreferences in the replacement argument of REGEXP_REPLACE are not recognized when written as \n. The backslash appears to be consumed by SQL string parsing before reaching the regex engine, so '\2' turns into the literal character '2' (ASCII 50). Using $2 does substitute the second capturing group correctly
How to reproduce:
#basic two-group match
SELECT
REGEXP_REPLACE('ab', '(a)(b)', '\2') AS got_backslash,
REGEXP_REPLACE('ab', '(a)(b)', '$2') AS got_dollar;
Expected got_backslash = 'b' (substitute group 2), got_dollar = 'b'.
Actual (Embucket) got_backslash = '2' (literally the digit), got_dollar = 'b'.
#confirm the actual character produced by '\2'
SELECT ASCII(REGEXP_REPLACE('ab', '(a)(b)', '\2')) AS codepoint_backslash;
Actual (Embucket) codepoint_backslash = 50 (digit '2'), indicating the backslash was stripped before regex replacement
#real-world style pattern (domain extraction)
SELECT
REGEXP_REPLACE('http://example.com/path',
'^https?://(www\\.)?([^/]+)/.*$',
'\2') AS got_backslash,
REGEXP_REPLACE('http://example.com/path',
'^https?://(www\\.)?([^/]+)/.*$',
'$2') AS got_dollar;
Expected:
\x02 $2
Actual:
2 example.com
Impact on clickbench results: Ported queries that rely on \n backreferences (for example query 29 in clickbench produce incorrect grouping keys (e.g., many rows collapse under '2' instead of the intended capture), corrupting aggregates.
Connected to #1504 & #1569