velox
velox copied to clipboard
Add pattern and replacement preprocessing to the Spark regexp_replace function
We need to add pattern and replacement preprocessing to the regexp_replace function to ensure compatibility with vanilla Spark's native regular expression handling.
For patterns, convert (?<name>regex) to (?P<name>regex) as with Presto.
For replacements:
- Convert group name capture to group index capture.
- Convert group index capture
$Nto\N. - RE2 only supports '\' followed by a digit or '\', while
java.util.regexwill ignore '\' in replacements, so we need to unescape it.
Deploy Preview for meta-velox canceled.
| Name | Link |
|---|---|
| Latest commit | f9a7dea8dd3e9eeb81e8007abec23dc5af00ca25 |
| Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/67428dc79afcea0008d8c127 |
@rui-mo Can you help review this PR?
@rui-mo I've update the code to resolve all comments. Please review again.
@rui-mo All comments updated.
@kecookier, does presto also use java.util.regex like Spark?
@kecookier, does presto also use java.util.regex like Spark?
@PHILO-HE Sorry, I have not thoroughly investigated Presto's Java code. It seems that Presto supports both JONI and RE2J, and according to the documentation, the syntax of the regex library is mostly the same as Java's Pattern.
The documant: https://prestodb.io/docs/current/functions/regexp.html
@kecookier, I see. The pr looks good. Thanks!
@rui-mo It seems that regexp_replace is in the skip list of either ExpressionFuzzerTest or SparkExpressionFuzzerTest. Do you have any ideas?
@rui-mo It seems that
regexp_replaceis in the skip list of either ExpressionFuzzerTest or SparkExpressionFuzzerTest. Do you have any ideas?
I see. It looks like the fuzzer test is skipped due to https://github.com/facebookincubator/velox/issues/8438.
Would you update the documentation for both Presto and Spark? The others look good to me. Thanks.
https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/presto/regexp.rst https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/regexp.rst
@rui-mo The documents have been updated. Please review them again.
cc: @mbasmanova @pedroerp If you have any comment. Thanks!
@kgpai could you help check if this won't affect any of our Presto queries?
Thanks for the change @kecookier . I verified some of these changes against Presto java and it looks good.
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Hi @kecookier, could you please rebase this PR onto the latest main? It is required for merging the PR. Thanks!
@kagamiori I've rebased onto the latest main, but one of the CI checks failed. It looks like the error is unrelated to this PR.
E20241124 03:25:55.815338 80204 Exceptions.h:66] Line: /home/runner/work/velox/velox/velox/velox/runner/LocalRunner.cpp:124, Function:waitForCompletion, Expression: LocalRunner did not finish within 5000 us, Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: LocalRunner did not finish within 5000 us
Retriable: False
Function: waitForCompletion
File: /home/runner/work/velox/velox/velox/velox/runner/LocalRunner.cpp
Line: 124
Stack trace:
# 0 _ZN8facebook5velox7process10StackTraceC1Ei
# 1 _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_
# 2 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 3 _ZN8facebook5velox17VeloxRuntimeErrorC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_
# 4 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 5 _ZN8facebook5velox6runner11LocalRunner17waitForCompletionEi
# 6 _ZN26LocalRunnerTest_error_Test8TestBodyEv
# 7 _ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
# 8 _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
# 9 _ZN7testing4Test3RunEv
# 10 _ZN7testing8TestInfo3RunEv
# 11 _ZN7testing9TestSuite3RunEv
# 12 _ZN7testing8internal12UnitTestImpl11RunAllTestsEv
# 13 _ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc
# 14 _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc
# 15 _ZN7testing8UnitTest3RunEv
# 16 _Z13RUN_ALL_TESTSv
# 17 main
# 18 0x0000000000029d8f
# 19 __libc_start_main
# 20 _start
" thrown in the test body.
[ FAILED ] LocalRunnerTest.error (1130 ms)
[ RUN ] LocalRunnerTest.scan
[ OK ] LocalRunnerTest.scan (1063 ms)
[----------] 3 tests from LocalRunnerTest (5038 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (5559 ms total)
[ PASSED ] 2 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] LocalRunnerTest.error
1 FAILED TEST
@kagamiori I've rebased onto the latest main, but one of the CI checks failed. It looks like the error is unrelated to this PR.
Hi @kecookier, thank you for rebasing! This failure is unrelated (https://github.com/facebookincubator/velox/issues/11619) and won't block your PR.
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@kagamiori merged this pull request in facebookincubator/velox@30009811365e1020c59c8d47039d67fc72805b6b.
Conbench analyzed the 1 benchmark run on commit 30009811.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.