flink
flink copied to clipboard
[FLINK-32706][table] Add built-in SPLIT_STRING function
What is the purpose of the change
This is an implementation of SPLIT
Brief change log
Splits a string into an array of substrings based on a delimiter. If the delimiter is not found, then the original string is returned as the only element in the array. If the delimiter is empty, then all characters in the string are split. If either, string or delimiter, are NULL, then a NULL value is returned. If the delimiter is found at the beginning or end of the string, or there are contiguous delimiters, then an empty space is added to the array.
-
Syntax
SPLIT(string, delimiter)
-
Arguments string: The string need to be split delimiter: Splits a string into an array of substrings based on a delimiter
-
Returns If the delimiter is not found, then the original string is returned as the only element in the array. If the delimiter is empty, then all characters in the string are split. If either, string or delimiter, are NULL, then a NULL value is returned.
-
Examples
SELECT SPLIT('abcdefg', 'c');
Result: ['ab', 'defg']
- See also
-
ksqlDB Split function ksqlDB provides a scalar function named SPLIT which splits a string into an array of substrings based on a delimiter. Syntax: SPLIT(string, delimiter) For example: SPLIT('a,b,c', ',') will return ['a', 'b', 'c']. https://docs.ksqldb.io/en/0.8.1-ksqldb/developer-guide/ksqldb-reference/scalar-functions/#split
-
Apache Hive Split function Hive offers a function named split which splits a string around a specified delimiter and returns an array of strings. Syntax: array
split(string str, string pat) For example: split('a,b,c', ',') will return ["a", "b", "c"]. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -
Spark SQL Split function Spark SQL also offers a function named split, similar to the one in Hive. Syntax: split(str, pattern[, limit]) Here, limit is an optional parameter to specify the maximum length of the returned array. For example: split('oneAtwoBthreeC', '[ABC]', 2) will return ["one", "twoBthreeC"]. https://spark.apache.org/docs/latest/api/sql/index.html#split
-
Presto Split function Presto offers a function named split which splits a string around a regular expression and returns an array of strings. Syntax: array
split(string str, string regex) For example: split('a.b.c', '.') will return ["a", "b", "c"]. https://prestodb.io/docs/current/functions/string.html
Verifying this change
This change added tests in CollectionFunctionsITCase.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes / no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: (yes / no) - The serializers: (yes / no / don't know)
- The runtime per-record code paths (performance sensitive): (yes / no / don't know)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
- The S3 file system connector: (yes / no / don't know)
Documentation
- Does this pull request introduce a new feature? (yes / no)
- If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
CI report:
- 6a2701523499b40ed80e5eb4e7d62c2da134cc80 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
If the delimiter is empty, every character in the string is split
Is that a requirement? As far as I can tell ksqlDB is the only engine that has that behaviour. Moreover after switching to splitByWholeSeparatorPreserveAllTokens
you do not guarantee this behaviour anymore.
@MartijnVisser What do you think should be the behaviour for an empty delimiter?
@dawidwys @MartijnVisser I think we can change the description, If the delimiter is empty, we will return entire string.
Now we use splitByWholeSeparatorPreserveAllTokens
, when delimiter is empty, it will return entire string.
Now we use splitByWholeSeparatorPreserveAllTokens, when delimiter is empty, it will return entire string.
Actually it splits by whitespaces: https://github.com/apache/flink/blob/8e5220b288e49c99333a4bc8ef7e3d5d27193921/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/data/binary/BinaryStringDataUtil.java#L107
I think we can change the description,
Let's first agree what is the desired behaviour.
@MartijnVisser What do you think should be the behaviour for an empty delimiter?
I've taken a tour along the various databases:
- Postgres doesn't have a SPLIT function, but leverages
regexp_split_to_table
orregexp_split_to_array
. When an empty delimiter is provided, it throws an error. https://www.postgresql.org/docs/current/functions-string.html - MySQL doesn't have a SPLIT function either, but has
SUBSTRING_INDEX
. - MSSQL Server has
STRING_SPLIT
https://learn.microsoft.com/en-us/sql/t-sql/functions/string-split-transact-sql?view=sql-server-ver16 which also doesn't accept empty delimiters - Spark has a
SPLIT
function and accepts empty delimiters, having the same behavior as ksqlDB https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.split.html - Presto has a
SPLIT
function and accepts empty delimiters, same behavior as ksqlDB https://prestodb.io/docs/current/functions/string.html - Clickhouse actually has a wide variety with
splitByChar
,splitByString
,splitByRegexp
etc https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions.splitByString
appears to have the same behavior as Spark and Presto
All in all, I could find 3 implementations of SPLIT
in ksqlDB, Spark, and Presto. All 3 accept empty delimiters. I would think then we should follow the same behavior, since all others have different function names for comparable/similar features
@MartijnVisser To confirm, you suggest to support an empty delimiter and in that case split all characters, meaning:
SPLIT('abcde', '') = ['a', 'b', 'c', 'd', 'e']
SPLIT('It is', '')=['I', 't', ' ', 'i', 's']
To confirm, you suggest to support an empty delimiter and in that case split all characters
Yes
@dawidwys So we can use splitByWholeSeparatorPreserveAllTokens
when the delimiter
is not empty. When the delimiter
is empty, do we need to handle this situation specially?
When the delimiter is empty, do we need to handle this situation specially?/ Yes, we do. Do you mind updating the PR?
When the delimiter is empty, do we need to handle this situation specially?/ Yes, we do. Do you mind updating the PR?
I have already updated the PR. https://github.com/apache/flink/pull/24365/files#r1606922419