coral icon indicating copy to clipboard operation
coral copied to clipboard

[Coral-Schema] handle single-element unions when the supplied column schema does not specify <uniontype>

Open ruolin59 opened this issue 2 months ago • 4 comments

What changes are proposed in this pull request, and why are they necessary?

When Avro schemas define single-element unions (e.g., ["string"] for primitives, [{"type":"record",...}] for array items/map values), the schema merge logic in MergeHiveSchemaWithAvro was incorrectly unwrapping them, changing non-nullable types to nullable types or losing the union structure entirely.

This issue occurs because some table systems (like OpenHouse) don't represent single-element unions as uniontype<...> in the Hive schema. When the Hive schema simply shows string or struct<...> instead of uniontype<string> or uniontype<struct<...>>, the merge logic treats them as regular types and unwraps the Avro union.

Changes made:

  1. HiveSchemaWithPartnerVisitor.java: Added logic at the top level of the visit() method to:

    • Detect when the Avro partner schema has a single-element union but the Hive TypeInfo doesn't declare a union (via shouldUnwrapPartner())
    • Temporarily unwrap the partner before the switch statement for proper type matching
    • Rewrap the result after processing to preserve the original Avro structure
  2. SchemaUtilities.java:

    • Added wrapInSingleElementUnion() utility method along with extractSingleElementUnion()
    • These methods provide a consistent API for single-element union handling
  3. Key behavior: The fix only applies when there's a mismatch between Hive and Avro:

    • If Hive declares uniontype<...>, normal union processing occurs (no unwrap/rewrap)
    • If Hive doesn't declare uniontype but Avro has a single-element union, we unwrap for matching then rewrap to preserve structure

This ensures that Avro schemas with single-element unions (commonly found in avro.schema.literal properties) maintain their exact structure after merging, preventing unintended nullability changes.

How was this patch tested?

Added two comprehensive unit tests in MergeHiveSchemaWithAvroTests:

  1. shouldHandleSingleElementUnionsInAllTypes:

    • Tests single-element unions in primitives (["string"], ["boolean"]), array items, and map values
    • Hive schema does NOT use uniontype (simulating OpenHouse-style schemas)
    • Verifies that all single-element unions are preserved in the output
  2. shouldHandleSingleElementUnionsWithHiveUnionType:

    • Tests backward compatibility when Hive explicitly declares uniontype<...>
    • Includes both regular multi-type unions and single-element unions
    • Ensures the fix doesn't interfere with normal union processing

Both tests verify that:

  • The input Avro schema structure is preserved (single-element unions remain wrapped)
  • Field names, nullability, docs, and default values are maintained
  • Complex nested structures (records within arrays/maps) work correctly

All existing tests continue to pass, confirming no regressions.

Also manually ran integration tests in spark, by generating a jar. Confirmed that the original symptoms are now fixed, ie. view on top of the table that had the single-element-union now show correct nullability for all fields

Also ran integration regression tests with coral-tools:

rexec ./coral-tools/scripts/rdevRegressionCheck.sh trino 2.2.54 https://github.com/ruolin59/coral.git fix-array-and-map-union-extraction 1-10

diff coral_v2.2.snapshot_failures.txt coral_v2.2.54_failures.txt yields no outputs; diff coral_v2.2.54_TrinoSQL.txt coral_v2.2.snapshot_TrinoSQL.txt yields no outputs; diff coral_v2.2.54_successes.txt coral_v2.2.snapshot_successes.txt yields no outputs

ruolin59 avatar Oct 30 '25 17:10 ruolin59

The description is not quite adding up since "unions" are not a thing in Coral. Could you rewrite the description using "records" which is the actual primitive? That would help understand the problem better.

wmoustafa avatar Oct 31 '25 04:10 wmoustafa

If you really need to use unions for the description, you may talk about unions in the Hive type system, then records in the Coral type system but you have to describe the whole flow and conversion chain with that segregation (but not mix unions with Coral type system description). If unions are not needed and we can state the problem just with records, then even better.

wmoustafa avatar Oct 31 '25 04:10 wmoustafa

@wmoustafa the "union" here isn't a union related to Hive, but union as related to avro literal schema definition. This is addressed in the description where it says

This causes a type mismatch when comparing against Hive schemas, which don't represent these structures as unions, leading to incorrect schema generation where fields that should be non-nullable (like "type": "string") become nullable (like "type": ["null", "string"])

the "union" here is more referring to the fact that it's de-facto either a union of [element, null], or "single-union" where it's just a union of [element]. Not sure if that make sense

ruolin59 avatar Oct 31 '25 23:10 ruolin59

From Coral-Schema’s perspective, it converges hive schema & avro schema literal. The input here is a "single-element union", but it isn’t represented as a "uniontype" in the passed hive schema, which is unexpected. This appears to be a caller-side bug. The caller should supply it explicitly as uniontype; we shouldn’t introduce special handling in Coral-Schema to infer user intent.

aastha25 avatar Nov 17 '25 17:11 aastha25