velox
velox copied to clipboard
Support regex delimiter and limit argument for Spark split function
Refactor spark split function to align with vinalia spark split function
- Support split using any regular expression (constant or non-constant)
- Support
limitparameter , limit can also<=0
https://spark.apache.org/docs/latest/api/sql/#split
@rui-mo / @PHILO-HE for help review.
Deploy Preview for meta-velox canceled.
| Name | Link |
|---|---|
| Latest commit | 2a399d72d9fd095efd1f464cacfb1d2d8df75152 |
| Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/66b32e174ee1b7000855e17a |
@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function https://github.com/facebookincubator/velox/pull/6155, and maybe we need to confirm its status first. cc: @PHILO-HE
@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE
oh, i see, i didn't go through the PRs to check if already people have touch this work. sorry.
@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE
btw, @rui-mo / @PHILO-HE . looks https://github.com/facebookincubator/velox/pull/6155 only make support for constant pattern, not cover for non-constant pattern or limit parameter. Will confirm in the PR comments
@rui-mo / @PHILO-HE / @FelixYBW since the PR https://github.com/facebookincubator/velox/pull/6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?
@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for
splitimprovement to let's help make this part finalize asap ?
@jackylee-ch, what do you think? I guess you took over the work of 6155.
@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for
splitimprovement to let's help make this part finalize asap ?
Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?
@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for
splitimprovement to let's help make this part finalize asap ?Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?
sure, will do. @rui-mo / @PHILO-HE so please help review.
@gaoyangxiaozhu Could you check the corner cases and corresponding tests from my PR first? Thanks. https://github.com/facebookincubator/velox/pull/8825/files#diff-ebc054da2ecc11ac3cf20d47cbc112f9c3b9cf3d5f1784b2149fb0f4e997a2d8R127-R132
updated and added ut to cover - https://github.com/gayangya/velox/blob/eddfb43831ac083a53ad17fcdca28bbd0c4b1e37/velox/functions/sparksql/tests/SplitFunctionsTest.cpp#L250
Could you run the fuzzer test for 5 minute using a command like below?
./velox/expression/fuzzer/spark_expression_fuzzer_test \ --seed ${RANDOM} \ --duration_sec 300 \ --minloglevel=0 \ --stderrthreshold=2 \ --only split
thanks @rui-mo , update a new commit to fix one regex match issue if the delimiter is not empty but has empty size match, and also locally run fuzz test 5 minutes and it passed.
./velox/expression/fuzzer/spark_expression_fuzzer_test --seed 1 --velox_fuzzer_enable_complex_types --duration_sec 300 --minloglevel=0 --stderrthreshold=2 --only split
I also run again locally for fuzz test 5 minutes and works fine
@rui-mo any new comments
kindly ping @rui-mo for any new comments, thank you!
Thanks. Will mark it if no other comments from @PHILO-HE @jinchengchenghh.
cool! thank you @rui-mo, then @PHILO-HE / @jinchengchenghh please help update if you have any new comments
please help review and let's make sure if can make this PR ready this week @PHILO-HE / @jinchengchenghh , thank you!
Only one @PHILO-HE 's question, I think we should check string is flat for strings->data() optimization, but we also have a else branch that not check string only check delimiter and limit is constant, because both dictionary vector and lazy vector is common. We should not let it move to non-constant delimiter and non-constant limit branch.
i got your point , but it is actually optional only for another optmization. Let me do in following patch.
@PHILO-HE any new comments ?
but it is actually optional only for another optmization. Let me do in following patch.
@gaoyangxiaozhu It make sense to me to add current fast path. Perhaps we can take it as a follow-up to use a benchmark to determine if further optimization is needed.
Would you address this comment https://github.com/facebookincubator/velox/pull/10248#discussion_r1677375306 ?
but it is actually optional only for another optmization. Let me do in following patch.
@gaoyangxiaozhu It make sense to me to add current fast path. Perhaps we can take it as a follow-up to use a benchmark to determine if further optimization is needed.
Would you address this comment #10248 (comment) ?
done
@mbasmanova for any new comments ?
@mbasmanova ping again
could we help speed up the review process, thank you
LGTM Thanks!
ping @mbasmanova
kindly ping @mbasmanova
@xiaoxmeng Can you merge the PR?
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@xiaoxmeng merged this pull request in facebookincubator/velox@b35bd6169c5f1aa8f0f3cfe75a89d5b8733be65d.
Conbench analyzed the 1 benchmark run on commit b35bd616.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.