velox icon indicating copy to clipboard operation
velox copied to clipboard

Add pattern and replacement preprocessing to the Spark regexp_replace function

Open kecookier opened this issue 1 year ago • 7 comments
trafficstars

We need to add pattern and replacement preprocessing to the regexp_replace function to ensure compatibility with vanilla Spark's native regular expression handling.

For patterns, convert (?<name>regex) to (?P<name>regex) as with Presto.

For replacements:

  • Convert group name capture to group index capture.
  • Convert group index capture $N to \N.
  • RE2 only supports '\' followed by a digit or '\', while java.util.regex will ignore '\' in replacements, so we need to unescape it.

kecookier avatar Sep 12 '24 03:09 kecookier

Deploy Preview for meta-velox canceled.

Name Link
Latest commit f9a7dea8dd3e9eeb81e8007abec23dc5af00ca25
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67428dc79afcea0008d8c127

netlify[bot] avatar Sep 12 '24 03:09 netlify[bot]

@rui-mo Can you help review this PR?

kecookier avatar Sep 12 '24 08:09 kecookier

@rui-mo I've update the code to resolve all comments. Please review again.

kecookier avatar Sep 18 '24 06:09 kecookier

@rui-mo All comments updated.

kecookier avatar Oct 22 '24 06:10 kecookier

@kecookier, does presto also use java.util.regex like Spark?

PHILO-HE avatar Oct 25 '24 08:10 PHILO-HE

@kecookier, does presto also use java.util.regex like Spark?

@PHILO-HE Sorry, I have not thoroughly investigated Presto's Java code. It seems that Presto supports both JONI and RE2J, and according to the documentation, the syntax of the regex library is mostly the same as Java's Pattern.

The documant: https://prestodb.io/docs/current/functions/regexp.html

kecookier avatar Oct 25 '24 09:10 kecookier

@kecookier, I see. The pr looks good. Thanks!

PHILO-HE avatar Oct 25 '24 11:10 PHILO-HE

@rui-mo It seems that regexp_replace is in the skip list of either ExpressionFuzzerTest or SparkExpressionFuzzerTest. Do you have any ideas?

kecookier avatar Nov 13 '24 07:11 kecookier

@rui-mo It seems that regexp_replace is in the skip list of either ExpressionFuzzerTest or SparkExpressionFuzzerTest. Do you have any ideas?

I see. It looks like the fuzzer test is skipped due to https://github.com/facebookincubator/velox/issues/8438.

rui-mo avatar Nov 13 '24 08:11 rui-mo

Would you update the documentation for both Presto and Spark? The others look good to me. Thanks.

https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/presto/regexp.rst https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/regexp.rst

@rui-mo The documents have been updated. Please review them again.

kecookier avatar Nov 17 '24 14:11 kecookier

cc: @mbasmanova @pedroerp If you have any comment. Thanks!

rui-mo avatar Nov 19 '24 08:11 rui-mo

@kgpai could you help check if this won't affect any of our Presto queries?

pedroerp avatar Nov 22 '24 05:11 pedroerp

Thanks for the change @kecookier . I verified some of these changes against Presto java and it looks good.

kgpai avatar Nov 22 '24 18:11 kgpai

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Nov 22 '24 18:11 facebook-github-bot

Hi @kecookier, could you please rebase this PR onto the latest main? It is required for merging the PR. Thanks!

kagamiori avatar Nov 22 '24 20:11 kagamiori

@kagamiori I've rebased onto the latest main, but one of the CI checks failed. It looks like the error is unrelated to this PR.

E20241124 03:25:55.815338 80204 Exceptions.h:66] Line: /home/runner/work/velox/velox/velox/velox/runner/LocalRunner.cpp:124, Function:waitForCompletion, Expression:  LocalRunner did not finish within 5000 us, Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: LocalRunner did not finish within 5000 us
Retriable: False
Function: waitForCompletion
File: /home/runner/work/velox/velox/velox/velox/runner/LocalRunner.cpp
Line: 124
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_
# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 3  _ZN8facebook5velox17VeloxRuntimeErrorC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_
# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 5  _ZN8facebook5velox6runner11LocalRunner17waitForCompletionEi
# 6  _ZN26LocalRunnerTest_error_Test8TestBodyEv
# 7  _ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
# 8  _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
# 9  _ZN7testing4Test3RunEv
# 10 _ZN7testing8TestInfo3RunEv
# 11 _ZN7testing9TestSuite3RunEv
# 12 _ZN7testing8internal12UnitTestImpl11RunAllTestsEv
# 13 _ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc
# 14 _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc
# 15 _ZN7testing8UnitTest3RunEv
# 16 _Z13RUN_ALL_TESTSv
# 17 main
# 18 0x0000000000029d8f
# 19 __libc_start_main
# 20 _start
" thrown in the test body.
[  FAILED  ] LocalRunnerTest.error (1130 ms)
[ RUN      ] LocalRunnerTest.scan
[       OK ] LocalRunnerTest.scan (1063 ms)
[----------] 3 tests from LocalRunnerTest (5038 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (5559 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] LocalRunnerTest.error

 1 FAILED TEST

kecookier avatar Nov 24 '24 04:11 kecookier

@kagamiori I've rebased onto the latest main, but one of the CI checks failed. It looks like the error is unrelated to this PR.

Hi @kecookier, thank you for rebasing! This failure is unrelated (https://github.com/facebookincubator/velox/issues/11619) and won't block your PR.

kagamiori avatar Nov 25 '24 18:11 kagamiori

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Nov 25 '24 18:11 facebook-github-bot

@kagamiori merged this pull request in facebookincubator/velox@30009811365e1020c59c8d47039d67fc72805b6b.

facebook-github-bot avatar Nov 26 '24 04:11 facebook-github-bot

Conbench analyzed the 1 benchmark run on commit 30009811.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

conbench-facebook[bot] avatar Nov 26 '24 05:11 conbench-facebook[bot]