[VL] Results are mismatch with vanilla Spark when use regexp_replace('a{bc', '\\{', '\\[')
Backend
VL (Velox)
Bug description
spark-sql> set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
spark-sql> select regexp_replace('a{bc', '\\{', '\\[');
vanilla get
a[bc
gluten get
abc
Spark version
Spark-3.2.x
Spark configurations
No response
System information
Velox System Info v0.0.2 Commit: 3a459abad5af1a8cc268c349bc3813249511768c CMake Version: 3.22.0 System: Linux-3.10.0-862.mt20190308.130.el7.x86_64 Arch: x86_64 CPU Name: Model name: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz C++ Compiler: /opt/rh/devtoolset-9/root/usr/bin/c++ C++ Compiler Version: 9.3.1 C Compiler: /opt/rh/devtoolset-9/root/usr/bin/cc C Compiler Version: 9.3.1 CMake Prefix Path: /usr/local;/usr;/;/usr/local/cmake;/usr/local;/usr/X11R6;/usr/pkg;/opt
Relevant logs
No response
thanks @kecookier do you will help fix the issue ?
thanks @kecookier do you will help fix the issue ?
Yes, I will work on this issue.
@kecookier I also encountered this issue, the reason is differences in handling \\
present situation
--- Query
with tb as (
select
name
FROM Values
('12234') people (name)
)
select regexp_replace(name, '22', '\\|') c1, 'gluten-spark' c2 from (select /*+ repartition(2) */ * from tb)
union
select regexp_replace(name, '22', '\\|') c1, 'vanilla-spark' c2 from tb
-- Result
c1 c2
1|34 vanilla-spark
134 gluten-spark
Gluten Velox RE2
1. code
#include <iostream>
#include <re2/re2.h>
int main() {
std::string input = "12234";
RE2 pattern("22");
std::string replacement = "\\|";
int replacements = RE2::GlobalReplace(&input, pattern, replacement);
std::cout << "Modified string: " << input << std::endl;
return 0;
}
2. output:
re2/re2.cc:1059: invalid rewrite pattern: \|
Modified string: 134
3. compile:
g++ -o example example.cpp -L/usr/local/lib -lre2 -Wl,-rpath,/usr/local/lib
Spark java.util.regex
1. code
import java.util.regex.Matcher
import java.util.regex.Pattern
val input = "12234"
val pattern: Pattern = Pattern.compile("22")
val replacement = "\\|"
val matcher: Matcher = pattern.matcher(input)
val result = matcher.replaceAll(replacement)
println(s"Modified string: $result")
2. output:
Modified string: 1|34
@wang-zhun Sorry for the late reply, and thank you for the information. I have a work-in-progress PR that will fix this issue.
Posting code where re2 error happens for replacement string with "\\": https://github.com/google/re2/blob/6dcd83d60f7944926bfd308cc13979fc53dd69ca/re2/re2.cc#L1053.
While java.util.regex.Matcher views characters after "\\" as literal, i.e., using what it is after "\\".
Can we just revise the replacement string to let Re2 do the correct replacement?
@PHILO-HE Apologies for the delayed response. This PR attempts to fix the mismatch problems in regexp_replace: https://github.com/facebookincubator/velox/pull/10981
https://github.com/facebookincubator/velox/pull/10981 Already merged