spark [SPARK-48906][SQL] Introduce `SHOW COLLATIONS LIKE ...` syntax to show all collations

What changes were proposed in this pull request?

The pr aims to introduce SHOW COLLATIONS LIKE ... syntax to show all collations.

Why are the changes needed?

End-users will be able to obtain collations currently supported by the spark through SQL. Other databases, such as MySQL, also have similar syntax, ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html

postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/

Does this PR introduce any user-facing change?

Yes, end-users will be able to obtain collation currently supported by the spark through commands similar to the following

name	provider	version	binaryEquality	binaryOrdering	lowercaseEquality

spark-sql (default)> SHOW COLLATIONS;
UTF8_BINARY	spark	1.0	true	true	false
UTF8_LCASE	spark	1.0	false	false	true
ff_Adlm	icu	153.120.0.0	false	false	false
ff_Adlm_CI	icu	153.120.0.0	false	false	false
ff_Adlm_AI	icu	153.120.0.0	false	false	false
ff_Adlm_CI_AI	icu	153.120.0.0	false	false	false
...

spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*';
UTF8_BINARY	spark	1.0	true	true	false
Time taken: 0.043 seconds, Fetched 1 row(s)

How was this patch tested?

Add new UT.

Was this patch authored or co-authored using generative AI tooling?

No.

Jul 16 '24 03:07 panbingkun

cc @cloud-fan @mihailom-db

Jul 16 '24 03:07 panbingkun

Currently only show normalized collation name.

Jul 17 '24 02:07 panbingkun

If necessary, we can also show columns: CaseSensitivity and AccentSensitivity

Jul 17 '24 02:07 panbingkun

Another option:

when execute SHOW COLLATIONS ..., only show columns: name, provider and version
when execute DESCRIBE COLLATIONS ..., will display columns: name, provider, version, binary_equality, binary_ordering, lowercase_equality, case_sensitivity and accent_sensitivity.

Jul 17 '24 03:07 panbingkun

Hi @panbingkun, thanks for taking initiative to push this work forward. The design of the table was discussed previously and the structure that was agreed upon should take a slightly different format. Let me list out things we need to include:

COLLATION_CATALOG (important for udf collations, for now it should be SYSTEM)
COLLATION_SCHEMA (important for udf collations, for now it should be BUILTIN)
COLLATION_NAME (Full name, with all identifiers, just like you did) ✅
LANGUAGE (Name of the language that corresponds to the locale of given collation. Null if there is no backing language (e.g. for UTF8_* family of collations))
COUNTRY (Name of the country that corresponds to the locale of given collation. Null if there is no backing country (e.g. for UTF8_* family of collations))
ACCENT_SENSITIVITY (ACCENT_SENSITIVE/ACCENT_INSENSITIVE)
CASE_SENSITIVITY (CASE_SENSITIVE/CASE_INSENSITIVE)
PAD_ATTRIBUTE (Attribute affects whether leading or trailing spaces are significant in string comparisons. Currently always NO_PAD)
ICU_VERSION (Null if not icu collation) (✅, partially done, just switch UTF8_* family to null)

All fields should be of string type and only language, country and version should be nullable.

Apart from SQL API, we need to support other APIs as well, which should be used by calling Session.catalog.collation. Because of this, your approach might need to be reworked a bit.

Please let me know if you have any additional questions and we can work through this PR together.

Jul 22 '24 07:07 mihailom-db

Apart from SQL API, we need to support other APIs as well, which should be used by calling Session.catalog.collation. Because of this, your approach might need to be reworked a bit.

@mihailom-db Thank you very much for your reply (very detailed). Okay, let's work together to complete it. I only have a small question, do we only use one command to display all the fields in the above list,

1.`SHOW COLLATIONS LIKE ....`

COLLATION_CATALOG	COLLATION_SCHEMA	COLLATION_NAME	LANGUAGE	COUNTRY	ACCENT_SENSITIVITY	CASE_SENSITIVITY	PAD_ATTRIBUTE	ICU_VERSION
SYSTEM	BUILTIN	UTF8_BINARY	NULL	NULL	ACCENT_SENSITIVE	CASE_SENSITIVITY	NO_PAD	NULL
SYSTEM	BUILTIN	UTF8_LCASE	NULL	NULL	ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	NULL
...	...	...	...	...	...	...	...	...

2.or do we use two commands, eg:

A.SHOW COLLATIONS LIKE ....

COLLATION_NAME
UTF8_BINARY
UTF8_LCASE
...

B.DESCRIBE COLLATION UTF8_BINARY

COLLATION_CATALOG	COLLATION_SCHEMA	COLLATION_NAME	LANGUAGE	COUNTRY	ACCENT_SENSITIVITY	CASE_SENSITIVITY	PAD_ATTRIBUTE	ICU_VERSION
SYSTEM	BUILTIN	UTF8_BINARY	NULL	NULL	ACCENT_SENSITIVE	CASE_SENSITIVITY	NO_PAD	NULL

Which of the above is more suitable?

Jul 22 '24 08:07 panbingkun

I believe for now we agreed to have only SHOW COLLATION(S) as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.

Jul 22 '24 08:07 mihailom-db

I believe for now we agreed to have only SHOW COLLATION(S) as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.

Okay.

Jul 22 '24 08:07 panbingkun

@mihailom-db All suggestions have been updated and verified locally as follows:

Jul 23 '24 03:07 panbingkun

Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.

Jul 23 '24 06:07 mihailom-db

Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.

@mihailom-db Thank you very much for your detailed explanation and response! Looking forward to your take one more look into results!

Jul 23 '24 07:07 panbingkun

Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.

Aug 01 '24 06:08 mihailom-db

Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.

Thank you very much for your patient review, let me update based on your suggestions. (Also, this PR is not currently the longest time span PR I have been working on. I remember there was another PR that spanned 1 year and 3 months, joking, 😄)

Aug 01 '24 08:08 panbingkun

Hi, @mihailom-db, I have a few questions to ask:

Does the CATALOG here correspond to spark's catalog? (the default value should be spark_catalog, not SYSTEM) Meanwhile, does SCHEMA correspond to spark's namespace(or schema) (the default value should be default, not BUILTIN )?
If the dimension of our collation is in the namespace (or schema), should we implement SHOW COLLATIONS ((FROM | IN) namespace=multipartIdentifier)? (LIKE? pattern)?; instead of SHOW COLLATIONS (LIKE? pattern)?;?

Aug 05 '24 08:08 panbingkun

Hi, @panbingkun. Sorry for delay in the response. As adding this listing of collation functionality is making other changes necessary, I believe it might be good to keep this PR to only adding SHOW COLLATIONS syntax, and then we will think about how to add support for different catalogs and schemas when we actually introduce collations with these features (user defined collations). I will add some more comments soon, so you can rework this PR and we can merge it in soon.

Aug 28 '24 06:08 mihailom-db

Hi, @panbingkun. Sorry for delay in the response. As adding this listing of collation functionality is making other changes necessary, I believe it might be good to keep this PR to only adding SHOW COLLATIONS syntax, and then we will think about how to add support for different catalogs and schemas when we actually introduce collations with these features (user defined collations). I will add some more comments soon, so you can rework this PR and we can merge it in soon.

Don't mention it, let's keep in sync at all times 😌

Aug 28 '24 09:08 panbingkun

@panbingkun Could you clarify, please, why do you implement this new command using V1 DataSource but not V2? This just require unnecessary effort for porting it on DataSource V2 API.

I'm not sure if I need to add an interface like CollationCatalog, Of course, if we don't add such an interface for now, it seems possible to have the method exist in V2SessionCatalog first. However, based on my understanding, it seems more reasonable to add the interface CollationCatalog. Is that correct? @MaxGekk

Sep 09 '24 07:09 panbingkun

However, based on my understanding, it seems more reasonable to add the interface CollationCatalog. Is that correct?

@panbingkun @mihailom-db I think since you have almost done V1 implementation, let's continue with it, and open a task for the migration this command on V2 here:https://issues.apache.org/jira/browse/SPARK-33392

Sep 09 '24 08:09 MaxGekk

I think since you have almost done V1 implementation, let's continue with it, and open a task for the migration this command on V2 here

Okay, let me to do it. V2: https://issues.apache.org/jira/browse/SPARK-49543 (after this, I will do it)

Sep 09 '24 08:09 panbingkun

The current sql command results are as follows:

(base) ➜  spark-community git:(show_collation_syntax) ✗ sh bin/spark-sql
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark Web UI available at http://172.26.2.19:4040
Spark master: local[*], Application Id: local-1725938017441
spark-sql (default)> SHOW system collations;
SYSTEM	BUILTIN	UTF8_BINARY	NULL	NULL	ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	NULL
SYSTEM	BUILTIN	UTF8_LCASE	NULL	NULL	ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	NULL
SYSTEM	BUILTIN	UNICODE			ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	UNICODE_AI			ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	UNICODE_CI			ACCENT_INSENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	UNICODE_CI_AI			ACCENT_INSENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af	Afrikaans		ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af_AI	Afrikaans		ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af_CI	Afrikaans		ACCENT_INSENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af_CI_AI	Afrikaans		ACCENT_INSENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	am	Amharic		ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	am_AI	Amharic		ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	am_CI	Amharic		ACCENT_INSENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
...

Sep 10 '24 03:09 panbingkun

@mihailom-db @MaxGekk all done!

Sep 10 '24 08:09 panbingkun

LGTM apart from minor version issue I posted before

It has been updated. Thank you for your patient review, thank you very much! ❤️

Sep 10 '24 09:09 panbingkun

@cloud-fan @MaxGekk could you take a look at this PR?

Sep 11 '24 04:09 mihailom-db

@panbingkun Could you resolve conflicts, please.

Done, thanks!

Sep 11 '24 13:09 panbingkun

@panbingkun It seems the test failure is related to your changes:

[info] - SPARK-43119: Get SQL Keywords *** FAILED *** (11 milliseconds)
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." did not equal "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..." (ThriftServerWithSparkContextSuite.scala:217)
[info]   Analysis:
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." -> "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..."

Could you fix it, please.

Sorry, it was bad. I made a hasty correction, but this time it should have been correct. Let's wait for GA, thank you very much!

Sep 11 '24 22:09 panbingkun

+1, LGTM. Merging to master. Thank you, @panbingkun and @mihailom-db @uros-db for review.

Sep 12 '24 00:09 MaxGekk

+1, LGTM. Merging to master. Thank you, @panbingkun and @mihailom-db @uros-db for review.

Thanks all @MaxGekk @mihailom-db @uros-db ❤️

Sep 12 '24 00:09 panbingkun

Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:

It's not standard, but we don't have a choice as Spark doesn't have information schema yet.
The SHOW ... LIKE ... semantic is different from the LIKE operator. It's also different from Hive's behavior which we copied from.
No corresponding DataFrame API.
We can't perform transformations with the output, something like SELECT ... FROM (SHOW ...) is not allowed.

Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like SELECT ... FROM all_collations()? Also cc @srielau

Sep 12 '24 03:09 cloud-fan

BTW we already have a TVF to get all the SQL keywords

@ExpressionDescription(
  usage = """_FUNC_() - Get Spark SQL keywords""",
  examples = """
    Examples:
      > SELECT * FROM _FUNC_() LIMIT 2;
       ADD  false
       AFTER  false
  """,
  since = "3.5.0",
  group = "generator_funcs")
case class SQLKeywords() extends LeafExpression with Generator with CodegenFallback {
  override def elementSchema: StructType = new StructType()
    .add("keyword", StringType, nullable = false)
    .add("reserved", BooleanType, nullable = false)

  override def eval(input: InternalRow): IterableOnce[InternalRow] = {
    val reservedList = getReservedList()
    keywords.zip(reservedList).map { case (keyword, isReserved) =>
      InternalRow(UTF8String.fromString(keyword), isReserved)
    }
  }

  override def prettyName: String = "sql_keywords"
}

Sep 12 '24 03:09 cloud-fan

Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:

It's not standard, but we don't have a choice as Spark doesn't have information schema yet.

The SHOW ... LIKE ... semantic is different from the LIKE operator. It's also different from Hive's behavior which we copied from.

No corresponding DataFrame API.

We can't perform transformations with the output, something like SELECT ... FROM (SHOW ...) is not allowed.

Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like SELECT ... FROM all_collations()? Also cc @srielau

Okay, let's investigate carefully first.

Sep 12 '24 10:09 panbingkun

[SPARK-48906][SQL] Introduce `SHOW COLLATIONS LIKE ...` syntax to show all collations

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

1.SHOW COLLATIONS LIKE ....

2.or do we use two commands, eg:

1.`SHOW COLLATIONS LIKE ....`