[SPARK-54621][SQL] Merge Into Update Set * preserve nested fields if …coerceNestedTypes is enabled
What changes were proposed in this pull request?
The 'struct coercion' feature for MERGE INTO (allowing it to pass if assigning a struct with less fields into a struct with more fields) is turned off in a flag in https://github.com/apache/spark/pull/53229 due to some ambiguity in behavior, but was not removed because the community wanted to try it.
We want to still keep it under a flag, but we make a choice about which behavior to support when the flag is on. In particular, we want UPDATE SET * to explode to all nested struct fields, so that in this scenario, existing nested struct fields are preserved.
Why are the changes needed?
@aokolnychyi tested the feature and thinks that even if it is behind the experimental flag, we should take the stance for now that UPDATE SET * should explode to all nested fields vs top level columns.
The rationale being:
- its always safer to not override user values with null
- Spark in general tries to treat nested fields like columns
- there's already a way for the user to override the whole struct (and nullify non-existing fields) by specifying the struct explicitly, ie UPDATE SET struct = source.struct
Does this PR introduce any user-facing change?
No, the whole feature is new and hidden behind an experimental flag.
How was this patch tested?
Existing tests (some output changes to not be null)
Was this patch authored or co-authored using generative AI tooling?
No
Hi @dongjoon-hyun sorry for the back and forth here. As in the description, @aokolnychyi explained he preferred to make the behavior choose UPDATE SET * to refer to nested fields, due to the reasons above. The whole feature (MERGE INTO struct coercion) is still under an experimental flag and off by default, but we want to make this stance if the flag is on.
Btw, The code is not new code, its the same code in https://github.com/apache/spark/pull/53149 which was removed, it is just brought back.
No problem, but I believe this is only applicable for master branch only, @szehon-ho .
So, +1 for 4.2.0 for the proposal although I didn't take a look at the code yet.
Hi, @dongjoon-hyun , @aokolnychyi mentioned it would be good to get into 4.1, because we are still releasing the feature of 'struct coercion' , albeit with a flag. So he wanted to start it off with the better choice. Code-wise its the same as before the revert, although the whole thing has a flag. Seems from the comments of https://github.com/apache/spark/pull/53229, the community is interested in this feature. Ill ping him to comment as well
Sorry but I still believe this fits for Apache Spark 4.2.0 (after checking the code again). This is only for Apache Spark 4.2.0, @szehon-ho . We are ramping down instead of ramping up.
Does this change anything when MERGE_INTO_NESTED_TYPE_COERCION_ENABLED is false?
Does this change anything when MERGE_INTO_NESTED_TYPE_COERCION_ENABLED is false?
Yea, it should not, that should be the guard for the whole feature (nested type coercion)
This PR actually fixes an issue I discovered while testing MERGE with 4.1 RC in Iceberg. I believe the current logic in Spark 4.1 is a regression and leads to data loss (we replace existing nested fields with nulls, for instance). Spark 4.0 was a lot stricter than 4.1 and some of the 4.1 behavior is invalid.