spark [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE

[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE

Open stefankandic opened this issue 9 months ago • 1 comments

What changes were proposed in this pull request?

Changing serialization and deserialization of collated strings so that the collation information is put in the metadata of the enclosing struct field - and then read back from there during parsing.

Format of serialization will look something like this:

{
  "type": "struct",
  "fields": [
    "name": "colName",
    "type": "string",
    "nullable": true,
    "metadata": {
      "__COLLATIONS": {
        "colName": "UNICODE"
      }
    }
  ]
}

If we have a map we will add suffixes .key and .value in the metadata:

{
  "type": "struct",
  "fields": [
    {
      "name": "mapField",
      "type": {
        "type": "map",
        "keyType": "string",
        "valueType": "string",
        "valueContainsNull": true
      },
      "nullable": true,
      "metadata": {
        "__COLLATIONS": {
          "mapField.key": "UNICODE",
          "mapField.value": "UNICODE"
        }
      }
    }
  ]
}

It will be a similar story for arrays (we will add .element suffix). We could have multiple suffixes when working with deeply nested data types (Map[String, Array[Array[String]]] - see tests for this example)

Why are the changes needed?

Putting collation info in field metadata is the only way to not break old clients reading new tables with collations. CharVarcharUtils does a similar thing but this is much less hacky, and more friendly for all 3p clients - which is especially important since delta also uses spark for schema ser/de.

It will also remove the need for additional logic introduced in #46083 to remove collations before writing to HMS as this way the tables will be fully HMS compatible.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

With unit tests

Was this patch authored or co-authored using generative AI tooling?

Apr 29 '24 10:04 stefankandic

@cloud-fan please take a look when you have the time

May 08 '24 09:05 stefankandic

@cloud-fan all checks passing, can we merge this?

May 17 '24 11:05 stefankandic

thanks, merging to master!

May 18 '24 07:05 cloud-fan

@cloud-fan I looked into HMS code a bit, and it seems that we can't save StructField metadata there, so I guess we will still have to keep converting schema with collation to schema without when creating a table in hive even though collations are no longer a type?

May 20 '24 15:05 stefankandic

I think so. String type with collation should be normal string type in the Hive table schema, so that other engines can still read it. We only keep the collation info in the Spark-specific table schema JSON string in table properties.

May 20 '24 23:05 cloud-fan

spark spark copied to clipboard

[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard