spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit

Open DenineLu opened this issue 1 year ago • 3 comments

What changes were proposed in this pull request?

After SPARK-40194, the current behavior of the split function is as follows:

select split('hello', 'h', 1) // result is ["hello"]
select split('hello', '-', 1) // result is ["hello"]
select split('hello', '', 1)  // result is ["h"]

select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string.

Arguments:
      * str - a string expression to split.
      * regex - a string representing a regular expression. The regex string should be a Java regular expression.
      * limit - an integer expression which controls the number of times the regex is applied.
          * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex.
          * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size. 

So, the split function produces incorrect results with an empty regex and a limit. The correct result should be:

select split('hello', '', 1)    // result is ["hello"]

select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

Why are the changes needed?

Fix correctness issue.

Does this PR introduce any user-facing change?

Yes. When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes. Before this patch

select split('hello', '', 1)          // result is ["h"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

After this patch

select split('hello', '', 1)          // result is ["hello"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

DenineLu avatar Oct 15 '24 05:10 DenineLu

cc @wangyum @cloud-fan

TongWei1105 avatar Oct 15 '24 06:10 TongWei1105

cc @vitaliili-db

wangyum avatar Oct 15 '24 07:10 wangyum

thanks for making this change - however, please add collation-related tests as well, see:

test("StringSplit expression with collated strings")

in CollationRegexpExpressionsSuite.scala

Thank you for your guidance. The relevant tests have been added.

DenineLu avatar Oct 17 '24 08:10 DenineLu

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Feb 01 '25 00:02 github-actions[bot]

What's the behavior of other mainstream databases?

beliefer avatar Feb 02 '25 07:02 beliefer

What's the behavior of other mainstream databases?

As far as I know, the split function of HIVE and StarRocks does not support the limit parameter, and the behavior of presto is consistent with the revised behavior of this PR.

DenineLu avatar Feb 06 '25 02:02 DenineLu

Please investigate more databases, then we make the decision which is the more suitable behavior.

beliefer avatar Feb 06 '25 08:02 beliefer

Please investigate more databases, then we make the decision which is the more suitable behavior.

For the split(str, regex, limit) function, the limit parameter controls the number of times regex is applied. I checked the related functions of mainstream databases. Except for presto and trino, there are basically no similar functions.

Database/SQLEngine Behavior of this function
PostgreSQL There is no similar function
MySQL There is no similar function
Oracle There is no similar function
MariaDB There is no similar function
Microsoft SQL Server There is no similar function
Hive There is no similar function
PrestoDB The behavior is consistent with this PR, and the functionality is implemented earlier than Spark
Trino The behavior is consistent with this PR, and the functionality is implemented earlier than Spark

DenineLu avatar Feb 07 '25 03:02 DenineLu

@cloud-fan This is a behavior change since Spark 3.4.

  SELECT str_to_map('p9999.c1.m9999.l5352', '\.', '')['p']; SELECT split('p9999', '', 2);
Spark <= 3.3 9999 ["p","9999"]
Spark > 3.4 9 ["p","9"]

wangyum avatar Jul 18 '25 15:07 wangyum

Yea I think we should fix this, but given the behavior change is already shipped, we need a legacy config to protect this fix.

cloud-fan avatar Jul 22 '25 14:07 cloud-fan

I added a configuration to fall back to the current behavior, can you review it again? @wangyum @cloud-fan

DenineLu avatar Jul 23 '25 08:07 DenineLu

thanks, merging to master!

cloud-fan avatar Jul 25 '25 09:07 cloud-fan