spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48906][SQL] Introduce `SHOW COLLATIONS LIKE ...` syntax to show all collations

Open panbingkun opened this issue 1 year ago • 16 comments

What changes were proposed in this pull request?

The pr aims to introduce SHOW COLLATIONS LIKE ... syntax to show all collations.

Why are the changes needed?

End-users will be able to obtain collations currently supported by the spark through SQL. Other databases, such as MySQL, also have similar syntax, ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html image

postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/

Does this PR introduce any user-facing change?

Yes, end-users will be able to obtain collation currently supported by the spark through commands similar to the following

name provider version binaryEquality binaryOrdering lowercaseEquality
spark-sql (default)> SHOW COLLATIONS;
UTF8_BINARY	spark	1.0	true	true	false
UTF8_LCASE	spark	1.0	false	false	true
ff_Adlm	icu	153.120.0.0	false	false	false
ff_Adlm_CI	icu	153.120.0.0	false	false	false
ff_Adlm_AI	icu	153.120.0.0	false	false	false
ff_Adlm_CI_AI	icu	153.120.0.0	false	false	false
...

spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*';
UTF8_BINARY	spark	1.0	true	true	false
Time taken: 0.043 seconds, Fetched 1 row(s)
image

How was this patch tested?

Add new UT.

Was this patch authored or co-authored using generative AI tooling?

No.

panbingkun avatar Jul 16 '24 03:07 panbingkun

cc @cloud-fan @mihailom-db

panbingkun avatar Jul 16 '24 03:07 panbingkun

Currently only show normalized collation name.

panbingkun avatar Jul 17 '24 02:07 panbingkun

If necessary, we can also show columns: CaseSensitivity and AccentSensitivity

panbingkun avatar Jul 17 '24 02:07 panbingkun

Another option:

  • when execute SHOW COLLATIONS ..., only show columns: name, provider and version
  • when execute DESCRIBE COLLATIONS ..., will display columns: name, provider, version, binary_equality, binary_ordering, lowercase_equality, case_sensitivity and accent_sensitivity.

panbingkun avatar Jul 17 '24 03:07 panbingkun

Hi @panbingkun, thanks for taking initiative to push this work forward. The design of the table was discussed previously and the structure that was agreed upon should take a slightly different format. Let me list out things we need to include:

  • COLLATION_CATALOG (important for udf collations, for now it should be SYSTEM)
  • COLLATION_SCHEMA (important for udf collations, for now it should be BUILTIN)
  • COLLATION_NAME (Full name, with all identifiers, just like you did) ✅
  • LANGUAGE (Name of the language that corresponds to the locale of given collation. Null if there is no backing language (e.g. for UTF8_* family of collations))
  • COUNTRY (Name of the country that corresponds to the locale of given collation. Null if there is no backing country (e.g. for UTF8_* family of collations))
  • ACCENT_SENSITIVITY (ACCENT_SENSITIVE/ACCENT_INSENSITIVE)
  • CASE_SENSITIVITY (CASE_SENSITIVE/CASE_INSENSITIVE)
  • PAD_ATTRIBUTE (Attribute affects whether leading or trailing spaces are significant in string comparisons. Currently always NO_PAD)
  • ICU_VERSION (Null if not icu collation) (✅, partially done, just switch UTF8_* family to null)

All fields should be of string type and only language, country and version should be nullable.

Apart from SQL API, we need to support other APIs as well, which should be used by calling Session.catalog.collation. Because of this, your approach might need to be reworked a bit.

Please let me know if you have any additional questions and we can work through this PR together.

mihailom-db avatar Jul 22 '24 07:07 mihailom-db

Apart from SQL API, we need to support other APIs as well, which should be used by calling Session.catalog.collation. Because of this, your approach might need to be reworked a bit.

@mihailom-db Thank you very much for your reply (very detailed). Okay, let's work together to complete it. I only have a small question, do we only use one command to display all the fields in the above list,

1.SHOW COLLATIONS LIKE ....

COLLATION_CATALOG COLLATION_SCHEMA COLLATION_NAME LANGUAGE COUNTRY ACCENT_SENSITIVITY CASE_SENSITIVITY PAD_ATTRIBUTE ICU_VERSION
SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVITY NO_PAD NULL
SYSTEM BUILTIN UTF8_LCASE NULL NULL ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD NULL
... ... ... ... ... ... ... ... ...

2.or do we use two commands, eg:

A.SHOW COLLATIONS LIKE ....

COLLATION_NAME
UTF8_BINARY
UTF8_LCASE
...

B.DESCRIBE COLLATION UTF8_BINARY

COLLATION_CATALOG COLLATION_SCHEMA COLLATION_NAME LANGUAGE COUNTRY ACCENT_SENSITIVITY CASE_SENSITIVITY PAD_ATTRIBUTE ICU_VERSION
SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVITY NO_PAD NULL

Which of the above is more suitable?

panbingkun avatar Jul 22 '24 08:07 panbingkun

I believe for now we agreed to have only SHOW COLLATION(S) as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.

mihailom-db avatar Jul 22 '24 08:07 mihailom-db

I believe for now we agreed to have only SHOW COLLATION(S) as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.

Okay.

panbingkun avatar Jul 22 '24 08:07 panbingkun

@mihailom-db All suggestions have been updated and verified locally as follows: image

panbingkun avatar Jul 23 '24 03:07 panbingkun

Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.

mihailom-db avatar Jul 23 '24 06:07 mihailom-db

Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.

@mihailom-db Thank you very much for your detailed explanation and response! Looking forward to your take one more look into results!

panbingkun avatar Jul 23 '24 07:07 panbingkun

Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.

mihailom-db avatar Aug 01 '24 06:08 mihailom-db

Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.

Thank you very much for your patient review, let me update based on your suggestions. (Also, this PR is not currently the longest time span PR I have been working on. I remember there was another PR that spanned 1 year and 3 months, joking, 😄)

panbingkun avatar Aug 01 '24 08:08 panbingkun

Hi, @mihailom-db, I have a few questions to ask:

  • Does the CATALOG here correspond to spark's catalog? (the default value should be spark_catalog, not SYSTEM) Meanwhile, does SCHEMA correspond to spark's namespace(or schema) (the default value should be default, not BUILTIN )?
  • If the dimension of our collation is in the namespace (or schema), should we implement SHOW COLLATIONS ((FROM | IN) namespace=multipartIdentifier)? (LIKE? pattern)?; instead of SHOW COLLATIONS (LIKE? pattern)?;?

panbingkun avatar Aug 05 '24 08:08 panbingkun

Hi, @panbingkun. Sorry for delay in the response. As adding this listing of collation functionality is making other changes necessary, I believe it might be good to keep this PR to only adding SHOW COLLATIONS syntax, and then we will think about how to add support for different catalogs and schemas when we actually introduce collations with these features (user defined collations). I will add some more comments soon, so you can rework this PR and we can merge it in soon.

mihailom-db avatar Aug 28 '24 06:08 mihailom-db

Hi, @panbingkun. Sorry for delay in the response. As adding this listing of collation functionality is making other changes necessary, I believe it might be good to keep this PR to only adding SHOW COLLATIONS syntax, and then we will think about how to add support for different catalogs and schemas when we actually introduce collations with these features (user defined collations). I will add some more comments soon, so you can rework this PR and we can merge it in soon.

Don't mention it, let's keep in sync at all times 😌

panbingkun avatar Aug 28 '24 09:08 panbingkun

@panbingkun Could you clarify, please, why do you implement this new command using V1 DataSource but not V2? This just require unnecessary effort for porting it on DataSource V2 API.

I'm not sure if I need to add an interface like CollationCatalog, Of course, if we don't add such an interface for now, it seems possible to have the method exist in V2SessionCatalog first. However, based on my understanding, it seems more reasonable to add the interface CollationCatalog. Is that correct? @MaxGekk

panbingkun avatar Sep 09 '24 07:09 panbingkun

However, based on my understanding, it seems more reasonable to add the interface CollationCatalog. Is that correct?

@panbingkun @mihailom-db I think since you have almost done V1 implementation, let's continue with it, and open a task for the migration this command on V2 here:https://issues.apache.org/jira/browse/SPARK-33392

MaxGekk avatar Sep 09 '24 08:09 MaxGekk

I think since you have almost done V1 implementation, let's continue with it, and open a task for the migration this command on V2 here

Okay, let me to do it. V2: https://issues.apache.org/jira/browse/SPARK-49543 (after this, I will do it)

panbingkun avatar Sep 09 '24 08:09 panbingkun

The current sql command results are as follows:

(base) ➜  spark-community git:(show_collation_syntax) ✗ sh bin/spark-sql
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark Web UI available at http://172.26.2.19:4040
Spark master: local[*], Application Id: local-1725938017441
spark-sql (default)> SHOW system collations;
SYSTEM	BUILTIN	UTF8_BINARY	NULL	NULL	ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	NULL
SYSTEM	BUILTIN	UTF8_LCASE	NULL	NULL	ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	NULL
SYSTEM	BUILTIN	UNICODE			ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	UNICODE_AI			ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	UNICODE_CI			ACCENT_INSENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	UNICODE_CI_AI			ACCENT_INSENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af	Afrikaans		ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af_AI	Afrikaans		ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af_CI	Afrikaans		ACCENT_INSENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	af_CI_AI	Afrikaans		ACCENT_INSENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	am	Amharic		ACCENT_SENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	am_AI	Amharic		ACCENT_SENSITIVE	CASE_INSENSITIVE	NO_PAD	75.1.0.0
SYSTEM	BUILTIN	am_CI	Amharic		ACCENT_INSENSITIVE	CASE_SENSITIVE	NO_PAD	75.1.0.0
...

panbingkun avatar Sep 10 '24 03:09 panbingkun

@mihailom-db @MaxGekk all done!

panbingkun avatar Sep 10 '24 08:09 panbingkun

LGTM apart from minor version issue I posted before

It has been updated. Thank you for your patient review, thank you very much! ❤️

panbingkun avatar Sep 10 '24 09:09 panbingkun

@cloud-fan @MaxGekk could you take a look at this PR?

mihailom-db avatar Sep 11 '24 04:09 mihailom-db

@panbingkun Could you resolve conflicts, please.

Done, thanks!

panbingkun avatar Sep 11 '24 13:09 panbingkun

@panbingkun It seems the test failure is related to your changes:

[info] - SPARK-43119: Get SQL Keywords *** FAILED *** (11 milliseconds)
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." did not equal "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..." (ThriftServerWithSparkContextSuite.scala:217)
[info]   Analysis:
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." -> "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..."

Could you fix it, please.

Sorry, it was bad. I made a hasty correction, but this time it should have been correct. Let's wait for GA, thank you very much!

panbingkun avatar Sep 11 '24 22:09 panbingkun

+1, LGTM. Merging to master. Thank you, @panbingkun and @mihailom-db @uros-db for review.

MaxGekk avatar Sep 12 '24 00:09 MaxGekk

+1, LGTM. Merging to master. Thank you, @panbingkun and @mihailom-db @uros-db for review.

Thanks all @MaxGekk @mihailom-db @uros-db ❤️

panbingkun avatar Sep 12 '24 00:09 panbingkun

Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:

  • It's not standard, but we don't have a choice as Spark doesn't have information schema yet.
  • The SHOW ... LIKE ... semantic is different from the LIKE operator. It's also different from Hive's behavior which we copied from.
  • No corresponding DataFrame API.
  • We can't perform transformations with the output, something like SELECT ... FROM (SHOW ...) is not allowed.

Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like SELECT ... FROM all_collations()? Also cc @srielau

cloud-fan avatar Sep 12 '24 03:09 cloud-fan

BTW we already have a TVF to get all the SQL keywords

@ExpressionDescription(
  usage = """_FUNC_() - Get Spark SQL keywords""",
  examples = """
    Examples:
      > SELECT * FROM _FUNC_() LIMIT 2;
       ADD  false
       AFTER  false
  """,
  since = "3.5.0",
  group = "generator_funcs")
case class SQLKeywords() extends LeafExpression with Generator with CodegenFallback {
  override def elementSchema: StructType = new StructType()
    .add("keyword", StringType, nullable = false)
    .add("reserved", BooleanType, nullable = false)

  override def eval(input: InternalRow): IterableOnce[InternalRow] = {
    val reservedList = getReservedList()
    keywords.zip(reservedList).map { case (keyword, isReserved) =>
      InternalRow(UTF8String.fromString(keyword), isReserved)
    }
  }

  override def prettyName: String = "sql_keywords"
}

cloud-fan avatar Sep 12 '24 03:09 cloud-fan

Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:

  • It's not standard, but we don't have a choice as Spark doesn't have information schema yet.
  • The SHOW ... LIKE ... semantic is different from the LIKE operator. It's also different from Hive's behavior which we copied from.
  • No corresponding DataFrame API.
  • We can't perform transformations with the output, something like SELECT ... FROM (SHOW ...) is not allowed.

Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like SELECT ... FROM all_collations()? Also cc @srielau

Okay, let's investigate carefully first.

panbingkun avatar Sep 12 '24 10:09 panbingkun