spark
spark copied to clipboard
[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE
What changes were proposed in this pull request?
Changing serialization and deserialization of collated strings so that the collation information is put in the metadata of the enclosing struct field - and then read back from there during parsing.
Format of serialization will look something like this:
{
"type": "struct",
"fields": [
"name": "colName",
"type": "string",
"nullable": true,
"metadata": {
"__COLLATIONS": {
"colName": "UNICODE"
}
}
]
}
If we have a map we will add suffixes .key
and .value
in the metadata:
{
"type": "struct",
"fields": [
{
"name": "mapField",
"type": {
"type": "map",
"keyType": "string",
"valueType": "string",
"valueContainsNull": true
},
"nullable": true,
"metadata": {
"__COLLATIONS": {
"mapField.key": "UNICODE",
"mapField.value": "UNICODE"
}
}
}
]
}
It will be a similar story for arrays (we will add .element
suffix). We could have multiple suffixes when working with deeply nested data types (Map[String, Array[Array[String]]] - see tests for this example)
Why are the changes needed?
Putting collation info in field metadata is the only way to not break old clients reading new tables with collations. CharVarcharUtils
does a similar thing but this is much less hacky, and more friendly for all 3p clients - which is especially important since delta also uses spark for schema ser/de.
It will also remove the need for additional logic introduced in #46083 to remove collations before writing to HMS as this way the tables will be fully HMS compatible.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
With unit tests
Was this patch authored or co-authored using generative AI tooling?
No
@cloud-fan please take a look when you have the time
@cloud-fan all checks passing, can we merge this?
thanks, merging to master!
@cloud-fan I looked into HMS code a bit, and it seems that we can't save StructField metadata there, so I guess we will still have to keep converting schema with collation to schema without when creating a table in hive even though collations are no longer a type?
I think so. String type with collation should be normal string type in the Hive table schema, so that other engines can still read it. We only keep the collation info in the Spark-specific table schema JSON string in table properties.