velox
velox copied to clipboard
Enhance split Spark function to support regex
Description
Split function just support split string
by a single character for now.
This pr to support split by regex
string.
split(str, regex)
Splits `str` around occurrences that match `regex` and returns an array as many times as possible
Arguments:
* str - a string expression to split.
* regex - a string representing a regular expression. The regex string should be a regular expression. It supported by RE2.
Examples:
SELECT split('oneAtwoBthreeC', '[ABC]'); -- ["one","two","three",""]
SELECT split('one', ''); -- ["o", "n", "e", ""]
SELECT split('one', '1'); -- ["one"]
Notice:
There are some semantic diff between Java regex and re2 regex. And the lookahead/lookbehind patterns are not supported by RE2. If used, there will be a runtime exception (compile pattern error). e.g. :
- Positive Lookahead: (?=regex)
- Negative Lookahead: (?!regex)
- Positive Lookbehind: (?<=regex)
- Negative Lookbehind:(?<!regex)
#6017
Deploy Preview for meta-velox canceled.
Name | Link |
---|---|
Latest commit | 5b627d65a6c1ec8e59d22885880ed886de2f4528 |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/662f06ebc31a830008b357bd |
Hi @mbasmanova @beroyfb ,could you help me review this pr? Thank you very much
Hi @mbasmanova @PHILO-HE @jackylee-ch , can you review this pr? Thank you
Hi @mbasmanova, let me help clarify. The original version only implements the basic split version in which the pattern is a single character. So this PR will address the limitations.
Let's update PR description to clarify this.
I just found spark uses JDK's regex.
I see. Do you know whether it uses Joni, RE2 or some other regex engine?
I just found spark uses JDK's regex.
I see. Do you know whether it uses Joni, RE2 or some other regex engine?
Neither Joni nor RE2. It looks JDK's own regex engine. See link.
Neither Joni nor RE2. It looks JDK's own regex engine.
Interesting. Isn't it very slow (especially compared with RE2)?
Neither Joni nor RE2. It looks JDK's own regex engine.
Interesting. Isn't it very slow (especially compared with RE2)?
Yes, I also guess it is very slow. And RE2 claimed here.
hi @mbasmanova , could you merge this pr, if no other problems?
@rui-mo Rui, can you help review this PR?
hi @rui-mo , can you review this pr again? thanks~
Is the function still tracked?
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!
cc @PHILO-HE @rui-mo
cc @PHILO-HE @rui-mo
@jackylee-ch, I meant we can only update the doc for this function without
limit
arg, as I notelimit
arg is not considered in this pr. Right? Except this, the pr looks good! Thanks!
The limit in split is not supported in current main
branch, thus we need remove the split doc that contains limit.
@PHILO-HE is the PR related to Cody's re2 support?