flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-32706][table] Add built-in SPLIT_STRING function

Open hanyuzheng7 opened this issue 1 year ago • 7 comments

What is the purpose of the change

This is an implementation of SPLIT

Brief change log

Splits a string into an array of substrings based on a delimiter. If the delimiter is not found, then the original string is returned as the only element in the array. If the delimiter is empty, then all characters in the string are split. If either, string or delimiter, are NULL, then a NULL value is returned. If the delimiter is found at the beginning or end of the string, or there are contiguous delimiters, then an empty space is added to the array.

  • Syntax SPLIT(string, delimiter)

  • Arguments string: The string need to be split delimiter: Splits a string into an array of substrings based on a delimiter

  • Returns If the delimiter is not found, then the original string is returned as the only element in the array. If the delimiter is empty, then all characters in the string are split. If either, string or delimiter, are NULL, then a NULL value is returned.

  • Examples

SELECT SPLIT('abcdefg', 'c');
Result: ['ab', 'defg']
  • See also
  1. ksqlDB Split function ksqlDB provides a scalar function named SPLIT which splits a string into an array of substrings based on a delimiter. Syntax: SPLIT(string, delimiter) For example: SPLIT('a,b,c', ',') will return ['a', 'b', 'c']. https://docs.ksqldb.io/en/0.8.1-ksqldb/developer-guide/ksqldb-reference/scalar-functions/#split

  2. Apache Hive Split function Hive offers a function named split which splits a string around a specified delimiter and returns an array of strings. Syntax: array split(string str, string pat) For example: split('a,b,c', ',') will return ["a", "b", "c"]. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  3. Spark SQL Split function Spark SQL also offers a function named split, similar to the one in Hive. Syntax: split(str, pattern[, limit]) Here, limit is an optional parameter to specify the maximum length of the returned array. For example: split('oneAtwoBthreeC', '[ABC]', 2) will return ["one", "twoBthreeC"]. https://spark.apache.org/docs/latest/api/sql/index.html#split

  4. Presto Split function Presto offers a function named split which splits a string around a regular expression and returns an array of strings. Syntax: array split(string str, string regex) For example: split('a.b.c', '.') will return ["a", "b", "c"]. https://prestodb.io/docs/current/functions/string.html

Verifying this change

This change added tests in CollectionFunctionsITCase.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

hanyuzheng7 avatar Feb 22 '24 03:02 hanyuzheng7

CI report:

  • 6a2701523499b40ed80e5eb4e7d62c2da134cc80 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Feb 22 '24 03:02 flinkbot

If the delimiter is empty, every character in the string is split

Is that a requirement? As far as I can tell ksqlDB is the only engine that has that behaviour. Moreover after switching to splitByWholeSeparatorPreserveAllTokens you do not guarantee this behaviour anymore.

@MartijnVisser What do you think should be the behaviour for an empty delimiter?

dawidwys avatar May 06 '24 13:05 dawidwys

@dawidwys @MartijnVisser I think we can change the description, If the delimiter is empty, we will return entire string. Now we use splitByWholeSeparatorPreserveAllTokens, when delimiter is empty, it will return entire string.

hanyuzheng7 avatar May 06 '24 14:05 hanyuzheng7

Now we use splitByWholeSeparatorPreserveAllTokens, when delimiter is empty, it will return entire string.

Actually it splits by whitespaces: https://github.com/apache/flink/blob/8e5220b288e49c99333a4bc8ef7e3d5d27193921/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/data/binary/BinaryStringDataUtil.java#L107

I think we can change the description,

Let's first agree what is the desired behaviour.

dawidwys avatar May 07 '24 09:05 dawidwys

@MartijnVisser What do you think should be the behaviour for an empty delimiter?

I've taken a tour along the various databases:

  • Postgres doesn't have a SPLIT function, but leverages regexp_split_to_table or regexp_split_to_array. When an empty delimiter is provided, it throws an error. https://www.postgresql.org/docs/current/functions-string.html
  • MySQL doesn't have a SPLIT function either, but has SUBSTRING_INDEX.
  • MSSQL Server has STRING_SPLIT https://learn.microsoft.com/en-us/sql/t-sql/functions/string-split-transact-sql?view=sql-server-ver16 which also doesn't accept empty delimiters
  • Spark has a SPLIT function and accepts empty delimiters, having the same behavior as ksqlDB https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.split.html
  • Presto has a SPLIT function and accepts empty delimiters, same behavior as ksqlDB https://prestodb.io/docs/current/functions/string.html
  • Clickhouse actually has a wide variety with splitByChar, splitByString, splitByRegexp etc https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions. splitByString appears to have the same behavior as Spark and Presto

All in all, I could find 3 implementations of SPLIT in ksqlDB, Spark, and Presto. All 3 accept empty delimiters. I would think then we should follow the same behavior, since all others have different function names for comparable/similar features

MartijnVisser avatar May 16 '24 11:05 MartijnVisser

@MartijnVisser To confirm, you suggest to support an empty delimiter and in that case split all characters, meaning:

SPLIT('abcde', '') = ['a', 'b', 'c', 'd', 'e']
SPLIT('It is', '')=['I', 't', ' ', 'i', 's']

dawidwys avatar May 16 '24 12:05 dawidwys

To confirm, you suggest to support an empty delimiter and in that case split all characters

Yes

MartijnVisser avatar May 16 '24 16:05 MartijnVisser

@dawidwys So we can use splitByWholeSeparatorPreserveAllTokens when the delimiter is not empty. When the delimiter is empty, do we need to handle this situation specially?

hanyuzheng7 avatar May 17 '24 02:05 hanyuzheng7

When the delimiter is empty, do we need to handle this situation specially?/ Yes, we do. Do you mind updating the PR?

dawidwys avatar May 20 '24 07:05 dawidwys

When the delimiter is empty, do we need to handle this situation specially?/ Yes, we do. Do you mind updating the PR?

I have already updated the PR. https://github.com/apache/flink/pull/24365/files#r1606922419

hanyuzheng7 avatar May 20 '24 15:05 hanyuzheng7