velox icon indicating copy to clipboard operation
velox copied to clipboard

Enhance split Spark function to support regex

Open unigof opened this issue 1 year ago • 12 comments

Description

Split function just support split string by a single character for now. This pr to support split by regex string.

split(str, regex)
    Splits `str` around occurrences that match `regex` and returns an array as many times as possible
    Arguments:
      * str - a string expression to split.
      * regex - a string representing a regular expression. The regex string should be a regular expression. It supported by RE2.

Examples:

 SELECT split('oneAtwoBthreeC', '[ABC]'); -- ["one","two","three",""]
 SELECT split('one', ''); -- ["o", "n", "e", ""]
 SELECT split('one', '1'); -- ["one"]

Notice:

There are some semantic diff between Java regex and re2 regex. And the lookahead/lookbehind patterns are not supported by RE2. If used, there will be a runtime exception (compile pattern error). e.g. :

  • Positive Lookahead: (?=regex)
  • Negative Lookahead: (?!regex)
  • Positive Lookbehind: (?<=regex)
  • Negative Lookbehind:(?<!regex)

#6017

unigof avatar Aug 18 '23 12:08 unigof

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 5b627d65a6c1ec8e59d22885880ed886de2f4528
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/662f06ebc31a830008b357bd

netlify[bot] avatar Aug 18 '23 12:08 netlify[bot]

Hi @mbasmanova @beroyfb ,could you help me review this pr? Thank you very much

unigof avatar Aug 25 '23 06:08 unigof

Hi @mbasmanova @PHILO-HE @jackylee-ch , can you review this pr? Thank you

unigof avatar Sep 14 '23 11:09 unigof

Hi @mbasmanova, let me help clarify. The original version only implements the basic split version in which the pattern is a single character. So this PR will address the limitations.

Let's update PR description to clarify this.

I just found spark uses JDK's regex.

I see. Do you know whether it uses Joni, RE2 or some other regex engine?

mbasmanova avatar Sep 15 '23 11:09 mbasmanova

I just found spark uses JDK's regex.

I see. Do you know whether it uses Joni, RE2 or some other regex engine?

Neither Joni nor RE2. It looks JDK's own regex engine. See link.

PHILO-HE avatar Sep 15 '23 12:09 PHILO-HE

Neither Joni nor RE2. It looks JDK's own regex engine.

Interesting. Isn't it very slow (especially compared with RE2)?

mbasmanova avatar Sep 15 '23 12:09 mbasmanova

Neither Joni nor RE2. It looks JDK's own regex engine.

Interesting. Isn't it very slow (especially compared with RE2)?

Yes, I also guess it is very slow. And RE2 claimed here.

PHILO-HE avatar Sep 15 '23 13:09 PHILO-HE

hi @mbasmanova , could you merge this pr, if no other problems?

unigof avatar Sep 20 '23 00:09 unigof

@rui-mo Rui, can you help review this PR?

mbasmanova avatar Sep 20 '23 00:09 mbasmanova

hi @rui-mo , can you review this pr again? thanks~

unigof avatar Sep 26 '23 05:09 unigof

Is the function still tracked?

FelixYBW avatar Dec 23 '23 02:12 FelixYBW

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

stale[bot] avatar Mar 22 '24 13:03 stale[bot]

cc @PHILO-HE @rui-mo

jackylee-ch avatar Apr 11 '24 12:04 jackylee-ch

cc @PHILO-HE @rui-mo

jackylee-ch avatar Apr 16 '24 09:04 jackylee-ch

@jackylee-ch, I meant we can only update the doc for this function without limit arg, as I note limit arg is not considered in this pr. Right? Except this, the pr looks good! Thanks!

The limit in split is not supported in current main branch, thus we need remove the split doc that contains limit.

jackylee-ch avatar Apr 19 '24 04:04 jackylee-ch

@PHILO-HE is the PR related to Cody's re2 support?

FelixYBW avatar Apr 19 '24 06:04 FelixYBW