[SPARK-49443][SQL][PYTHON] Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects
What changes were proposed in this pull request?
This PR prohibits casts from data types containing structs or maps to variant and introduces a new expression to_variant_object which allows casting from nested types to variants and retains the old functionality. This PR also changes the behavior of the schema_of_variant and schema_of_variant_agg expressions where they now print OBJECT instead of STRUCT (which is not technically correct).
Why are the changes needed?
Cast from structs to variant objects should not be legal since variant objects are unordered bags of key-value pairs while structs are ordered sets of elements of fixed types. Therefore, casts between structs and variant objects do not behave like casts between structs. Example (produced by Serge Rielau):
scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct<b int, c int>)").show()
+------------------------+
|named_struct(c, 1, b, 2)|
+------------------------+
|{1, 2}|
+------------------------+
Passing a struct into VARIANT loses the position
scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as struct<b int, c int>)").show()
+-----------------------------------------+
|CAST(named_struct(c, 1, b, 2) AS VARIANT)|
+-----------------------------------------+
|{2, 1}|
+-----------------------------------------+
Casts from maps to variant objects should also not be legal since they represent completely orthogonal data types. Maps can represent a variable number of key value pairs based on just a key and value type in the schema but in objects, the schema (produced by schema_of_variant expressions) will have a type corresponding to each value in the object. Objects can have values of different types while maps cannot and objects can only have string keys while maps can also have complex keys.
We should therefore prohibit the existing behavior of allowing explicit casts from structs and maps to variants as the variant spec currently only supports an object type which is remotely compatible with structs and maps. We introduce a new expression that converts schemas containing structs and maps to variants (where these types are converted to objects). We will call it to_variant_object.
Does this PR introduce any user-facing change?
Yes, it introduces the to_variant_object expression and changes the behavior of the schema_of_variant/schema_of_variant_agg expressions.
How was this patch tested?
Several unit tests with codegen enabled/disabled.
Was this patch authored or co-authored using generative AI tooling?
Yes. Generated-by: GitHub Copilot, perplexity.ai
I haven't generated the golden files yet as I don't remember which commands to run. I'll figure it out based on test failures.
Note to reviewers: There is currently a bug when using UTF8_LCASE collation. I am looking into it.
scala> sql("""select to_variant_object(map("a" collate utf8_lcase, 2))""").collect()
org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase optimization failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. SQLSTATE: XX000
Note to reviewers: There is currently a bug when using
UTF8_LCASEcollation. I am looking into it.scala> sql("""select to_variant_object(map("a" collate utf8_lcase, 2))""").collect() org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase optimization failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. SQLSTATE: XX000
Thia issue has been fixed
@HyukjinKwon Can you go over the Python changes in this PR?
@cloud-fan Can you go over this PR again whenever you're available?
@cloud-fan There may be an issue with the Scala linter test. It was passing earlier and is failing after a very minor commit. It says there are lint failures in the sql/connect and connector/connect spaces which this PR is not even modifying. It recommends a command to fix these issues but that command is making several unrelated changes across the codebase.
The link failure is unrelated, thanks, merging to master!