[SPARK-48906][SQL] Introduce `SHOW COLLATIONS LIKE ...` syntax to show all collations
What changes were proposed in this pull request?
The pr aims to introduce SHOW COLLATIONS LIKE ... syntax to show all collations.
Why are the changes needed?
End-users will be able to obtain collations currently supported by the spark through SQL.
Other databases, such as MySQL, also have similar syntax,
ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html
postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/
Does this PR introduce any user-facing change?
Yes, end-users will be able to obtain collation currently supported by the spark through commands similar to the following
| name | provider | version | binaryEquality | binaryOrdering | lowercaseEquality |
|---|
spark-sql (default)> SHOW COLLATIONS;
UTF8_BINARY spark 1.0 true true false
UTF8_LCASE spark 1.0 false false true
ff_Adlm icu 153.120.0.0 false false false
ff_Adlm_CI icu 153.120.0.0 false false false
ff_Adlm_AI icu 153.120.0.0 false false false
ff_Adlm_CI_AI icu 153.120.0.0 false false false
...
spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*';
UTF8_BINARY spark 1.0 true true false
Time taken: 0.043 seconds, Fetched 1 row(s)
How was this patch tested?
Add new UT.
Was this patch authored or co-authored using generative AI tooling?
No.
cc @cloud-fan @mihailom-db
Currently only show normalized collation name.
If necessary, we can also show columns: CaseSensitivity and AccentSensitivity
Another option:
- when execute
SHOW COLLATIONS ..., only show columns:name,providerandversion - when execute
DESCRIBE COLLATIONS ..., will display columns:name,provider,version,binary_equality,binary_ordering,lowercase_equality,case_sensitivityandaccent_sensitivity.
Hi @panbingkun, thanks for taking initiative to push this work forward. The design of the table was discussed previously and the structure that was agreed upon should take a slightly different format. Let me list out things we need to include:
- COLLATION_CATALOG (important for udf collations, for now it should be SYSTEM)
- COLLATION_SCHEMA (important for udf collations, for now it should be BUILTIN)
- COLLATION_NAME (Full name, with all identifiers, just like you did) ✅
- LANGUAGE (Name of the language that corresponds to the locale of given collation. Null if there is no backing language (e.g. for UTF8_* family of collations))
- COUNTRY (Name of the country that corresponds to the locale of given collation. Null if there is no backing country (e.g. for UTF8_* family of collations))
- ACCENT_SENSITIVITY (ACCENT_SENSITIVE/ACCENT_INSENSITIVE)
- CASE_SENSITIVITY (CASE_SENSITIVE/CASE_INSENSITIVE)
- PAD_ATTRIBUTE (Attribute affects whether leading or trailing spaces are significant in string comparisons. Currently always NO_PAD)
- ICU_VERSION (Null if not icu collation) (✅, partially done, just switch UTF8_* family to null)
All fields should be of string type and only language, country and version should be nullable.
Apart from SQL API, we need to support other APIs as well, which should be used by calling Session.catalog.collation. Because of this, your approach might need to be reworked a bit.
Please let me know if you have any additional questions and we can work through this PR together.
Apart from SQL API, we need to support other APIs as well, which should be used by calling
Session.catalog.collation. Because of this, your approach might need to be reworked a bit.
@mihailom-db
Thank you very much for your reply (very detailed). Okay, let's work together to complete it.
I only have a small question, do we only use one command to display all the fields in the above list,
1.SHOW COLLATIONS LIKE ....
| COLLATION_CATALOG | COLLATION_SCHEMA | COLLATION_NAME | LANGUAGE | COUNTRY | ACCENT_SENSITIVITY | CASE_SENSITIVITY | PAD_ATTRIBUTE | ICU_VERSION |
|---|---|---|---|---|---|---|---|---|
| SYSTEM | BUILTIN | UTF8_BINARY | NULL | NULL | ACCENT_SENSITIVE | CASE_SENSITIVITY | NO_PAD | NULL |
| SYSTEM | BUILTIN | UTF8_LCASE | NULL | NULL | ACCENT_SENSITIVE | CASE_INSENSITIVE | NO_PAD | NULL |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
2.or do we use two commands, eg:
A.SHOW COLLATIONS LIKE ....
| COLLATION_NAME |
|---|
| UTF8_BINARY |
| UTF8_LCASE |
| ... |
B.DESCRIBE COLLATION UTF8_BINARY
| COLLATION_CATALOG | COLLATION_SCHEMA | COLLATION_NAME | LANGUAGE | COUNTRY | ACCENT_SENSITIVITY | CASE_SENSITIVITY | PAD_ATTRIBUTE | ICU_VERSION |
|---|---|---|---|---|---|---|---|---|
| SYSTEM | BUILTIN | UTF8_BINARY | NULL | NULL | ACCENT_SENSITIVE | CASE_SENSITIVITY | NO_PAD | NULL |
Which of the above is more suitable?
I believe for now we agreed to have only SHOW COLLATION(S) as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.
I believe for now we agreed to have only
SHOW COLLATION(S)as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.
Okay.
@mihailom-db All suggestions have been updated and verified locally as follows:
Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.
Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.
@mihailom-db
Thank you very much for your detailed explanation and response!
Looking forward to your take one more look into results!
Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.
Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.
Thank you very much for your patient review, let me update based on your suggestions. (Also, this PR is not currently the longest time span PR I have been working on. I remember there was another PR that spanned 1 year and 3 months, joking, 😄)
Hi, @mihailom-db, I have a few questions to ask:
- Does the
CATALOGhere correspond to spark'scatalog? (the default value should bespark_catalog, notSYSTEM) Meanwhile, doesSCHEMAcorrespond to spark'snamespace(or schema)(the default value should bedefault, notBUILTIN)? - If the dimension of our collation is in the
namespace(orschema), should we implementSHOW COLLATIONS ((FROM | IN) namespace=multipartIdentifier)? (LIKE? pattern)?;instead ofSHOW COLLATIONS (LIKE? pattern)?;?
Hi, @panbingkun. Sorry for delay in the response. As adding this listing of collation functionality is making other changes necessary, I believe it might be good to keep this PR to only adding SHOW COLLATIONS syntax, and then we will think about how to add support for different catalogs and schemas when we actually introduce collations with these features (user defined collations). I will add some more comments soon, so you can rework this PR and we can merge it in soon.
Hi, @panbingkun. Sorry for delay in the response. As adding this listing of collation functionality is making other changes necessary, I believe it might be good to keep this PR to only adding SHOW COLLATIONS syntax, and then we will think about how to add support for different catalogs and schemas when we actually introduce collations with these features (user defined collations). I will add some more comments soon, so you can rework this PR and we can merge it in soon.
Don't mention it, let's keep in sync at all times 😌
@panbingkun Could you clarify, please, why do you implement this new command using V1 DataSource but not V2? This just require unnecessary effort for porting it on DataSource V2 API.
I'm not sure if I need to add an interface like CollationCatalog,
Of course, if we don't add such an interface for now, it seems possible to have the method exist in V2SessionCatalog first. However, based on my understanding, it seems more reasonable to add the interface CollationCatalog. Is that correct? @MaxGekk
However, based on my understanding, it seems more reasonable to add the interface CollationCatalog. Is that correct?
@panbingkun @mihailom-db I think since you have almost done V1 implementation, let's continue with it, and open a task for the migration this command on V2 here:https://issues.apache.org/jira/browse/SPARK-33392
I think since you have almost done V1 implementation, let's continue with it, and open a task for the migration this command on V2 here
Okay, let me to do it. V2: https://issues.apache.org/jira/browse/SPARK-49543 (after this, I will do it)
The current sql command results are as follows:
(base) ➜ spark-community git:(show_collation_syntax) ✗ sh bin/spark-sql
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark Web UI available at http://172.26.2.19:4040
Spark master: local[*], Application Id: local-1725938017441
spark-sql (default)> SHOW system collations;
SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVE NO_PAD NULL
SYSTEM BUILTIN UTF8_LCASE NULL NULL ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD NULL
SYSTEM BUILTIN UNICODE ACCENT_SENSITIVE CASE_SENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN UNICODE_AI ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN UNICODE_CI ACCENT_INSENSITIVE CASE_SENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN UNICODE_CI_AI ACCENT_INSENSITIVE CASE_INSENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN af Afrikaans ACCENT_SENSITIVE CASE_SENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN af_AI Afrikaans ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN af_CI Afrikaans ACCENT_INSENSITIVE CASE_SENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN af_CI_AI Afrikaans ACCENT_INSENSITIVE CASE_INSENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN am Amharic ACCENT_SENSITIVE CASE_SENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN am_AI Amharic ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD 75.1.0.0
SYSTEM BUILTIN am_CI Amharic ACCENT_INSENSITIVE CASE_SENSITIVE NO_PAD 75.1.0.0
...
@mihailom-db @MaxGekk all done!
LGTM apart from minor version issue I posted before
It has been updated. Thank you for your patient review, thank you very much! ❤️
@cloud-fan @MaxGekk could you take a look at this PR?
@panbingkun Could you resolve conflicts, please.
Done, thanks!
@panbingkun It seems the test failure is related to your changes:
[info] - SPARK-43119: Get SQL Keywords *** FAILED *** (11 milliseconds) [info] "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." did not equal "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..." (ThriftServerWithSparkContextSuite.scala:217) [info] Analysis: [info] "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." -> "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..."Could you fix it, please.
Sorry, it was bad. I made a hasty correction, but this time it should have been correct.
Let's wait for GA, thank you very much!
+1, LGTM. Merging to master. Thank you, @panbingkun and @mihailom-db @uros-db for review.
+1, LGTM. Merging to master. Thank you, @panbingkun and @mihailom-db @uros-db for review.
Thanks all @MaxGekk @mihailom-db @uros-db ❤️
Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:
- It's not standard, but we don't have a choice as Spark doesn't have information schema yet.
- The
SHOW ... LIKE ...semantic is different from theLIKEoperator. It's also different from Hive's behavior which we copied from. - No corresponding DataFrame API.
- We can't perform transformations with the output, something like
SELECT ... FROM (SHOW ...)is not allowed.
Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like SELECT ... FROM all_collations()? Also cc @srielau
BTW we already have a TVF to get all the SQL keywords
@ExpressionDescription(
usage = """_FUNC_() - Get Spark SQL keywords""",
examples = """
Examples:
> SELECT * FROM _FUNC_() LIMIT 2;
ADD false
AFTER false
""",
since = "3.5.0",
group = "generator_funcs")
case class SQLKeywords() extends LeafExpression with Generator with CodegenFallback {
override def elementSchema: StructType = new StructType()
.add("keyword", StringType, nullable = false)
.add("reserved", BooleanType, nullable = false)
override def eval(input: InternalRow): IterableOnce[InternalRow] = {
val reservedList = getReservedList()
keywords.zip(reservedList).map { case (keyword, isReserved) =>
InternalRow(UTF8String.fromString(keyword), isReserved)
}
}
override def prettyName: String = "sql_keywords"
}
Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:
- It's not standard, but we don't have a choice as Spark doesn't have information schema yet.
- The
SHOW ... LIKE ...semantic is different from theLIKEoperator. It's also different from Hive's behavior which we copied from.- No corresponding DataFrame API.
- We can't perform transformations with the output, something like
SELECT ... FROM (SHOW ...)is not allowed.Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like
SELECT ... FROM all_collations()? Also cc @srielau
Okay, let's investigate carefully first.