spark [SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit

What changes were proposed in this pull request?

After SPARK-40194, the current behavior of the split function is as follows:

select split('hello', 'h', 1) // result is ["hello"]
select split('hello', '-', 1) // result is ["hello"]
select split('hello', '', 1)  // result is ["h"]

select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string.

Arguments:
      * str - a string expression to split.
      * regex - a string representing a regular expression. The regex string should be a Java regular expression.
      * limit - an integer expression which controls the number of times the regex is applied.
          * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex.
          * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size.

So, the split function produces incorrect results with an empty regex and a limit. The correct result should be:

select split('hello', '', 1)    // result is ["hello"]

select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

Why are the changes needed?

Fix correctness issue.

Does this PR introduce any user-facing change?

Yes. When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes. Before this patch

select split('hello', '', 1)          // result is ["h"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

After this patch

select split('hello', '', 1)          // result is ["hello"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Oct 15 '24 05:10 DenineLu

cc @wangyum @cloud-fan

Oct 15 '24 06:10 TongWei1105

cc @vitaliili-db

Oct 15 '24 07:10 wangyum

thanks for making this change - however, please add collation-related tests as well, see:
test("StringSplit expression with collated strings")
in CollationRegexpExpressionsSuite.scala

Thank you for your guidance. The relevant tests have been added.

Oct 17 '24 08:10 DenineLu

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Feb 01 '25 00:02 github-actions[bot]

What's the behavior of other mainstream databases?

Feb 02 '25 07:02 beliefer

What's the behavior of other mainstream databases?

As far as I know, the split function of HIVE and StarRocks does not support the limit parameter, and the behavior of presto is consistent with the revised behavior of this PR.

Feb 06 '25 02:02 DenineLu

Please investigate more databases, then we make the decision which is the more suitable behavior.

Feb 06 '25 08:02 beliefer

Please investigate more databases, then we make the decision which is the more suitable behavior.

For the split(str, regex, limit) function, the limit parameter controls the number of times regex is applied. I checked the related functions of mainstream databases. Except for presto and trino, there are basically no similar functions.

Database/SQLEngine	Behavior of this function
PostgreSQL	There is no similar function
MySQL	There is no similar function
Oracle	There is no similar function
MariaDB	There is no similar function
Microsoft SQL Server	There is no similar function
Hive	There is no similar function
PrestoDB	The behavior is consistent with this PR, and the functionality is implemented earlier than Spark
Trino	The behavior is consistent with this PR, and the functionality is implemented earlier than Spark

Feb 07 '25 03:02 DenineLu

@cloud-fan This is a behavior change since Spark 3.4.

	SELECT str_to_map('p9999.c1.m9999.l5352', '\.', '')['p'];	SELECT split('p9999', '', 2);
Spark <= 3.3	9999	["p","9999"]
Spark > 3.4	9	["p","9"]

Jul 18 '25 15:07 wangyum

Yea I think we should fix this, but given the behavior change is already shipped, we need a legacy config to protect this fix.

Jul 22 '25 14:07 cloud-fan

I added a configuration to fall back to the current behavior, can you review it again? @wangyum @cloud-fan

Jul 23 '25 08:07 DenineLu

thanks, merging to master!

Jul 25 '25 09:07 cloud-fan

spark spark copied to clipboard

[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard