spark
spark copied to clipboard
[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit
What changes were proposed in this pull request?
After SPARK-40194, the current behavior of the split function is as follows:
select split('hello', 'h', 1) // result is ["hello"]
select split('hello', '-', 1) // result is ["hello"]
select split('hello', '', 1) // result is ["h"]
select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"]
select split('1A2A3A4', '', 3) // result is ["1","A","2"]
However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string.
Arguments:
* str - a string expression to split.
* regex - a string representing a regular expression. The regex string should be a Java regular expression.
* limit - an integer expression which controls the number of times the regex is applied.
* limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex.
* limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size.
So, the split function produces incorrect results with an empty regex and a limit. The correct result should be:
select split('hello', '', 1) // result is ["hello"]
select split('1A2A3A4', '', 3) // result is ["1","A","2A3A4"]
Why are the changes needed?
Fix correctness issue.
Does this PR introduce any user-facing change?
Yes. When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes. Before this patch
select split('hello', '', 1) // result is ["h"]
select split('1A2A3A4', '', 3) // result is ["1","A","2"]
After this patch
select split('hello', '', 1) // result is ["hello"]
select split('1A2A3A4', '', 3) // result is ["1","A","2A3A4"]
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.
cc @wangyum @cloud-fan
cc @vitaliili-db
thanks for making this change - however, please add collation-related tests as well, see:
test("StringSplit expression with collated strings")in
CollationRegexpExpressionsSuite.scala
Thank you for your guidance. The relevant tests have been added.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
What's the behavior of other mainstream databases?
What's the behavior of other mainstream databases?
As far as I know, the split function of HIVE and StarRocks does not support the limit parameter, and the behavior of presto is consistent with the revised behavior of this PR.
Please investigate more databases, then we make the decision which is the more suitable behavior.
Please investigate more databases, then we make the decision which is the more suitable behavior.
For the split(str, regex, limit) function, the limit parameter controls the number of times regex is applied. I checked the related functions of mainstream databases. Except for presto and trino, there are basically no similar functions.
| Database/SQLEngine | Behavior of this function |
|---|---|
| PostgreSQL | There is no similar function |
| MySQL | There is no similar function |
| Oracle | There is no similar function |
| MariaDB | There is no similar function |
| Microsoft SQL Server | There is no similar function |
| Hive | There is no similar function |
| PrestoDB | The behavior is consistent with this PR, and the functionality is implemented earlier than Spark |
| Trino | The behavior is consistent with this PR, and the functionality is implemented earlier than Spark |
@cloud-fan This is a behavior change since Spark 3.4.
| SELECT str_to_map('p9999.c1.m9999.l5352', '\.', '')['p']; | SELECT split('p9999', '', 2); | |
|---|---|---|
| Spark <= 3.3 | 9999 | ["p","9999"] |
| Spark > 3.4 | 9 | ["p","9"] |
Yea I think we should fix this, but given the behavior change is already shipped, we need a legacy config to protect this fix.
I added a configuration to fall back to the current behavior, can you review it again? @wangyum @cloud-fan
thanks, merging to master!