[Coral-Schema] handle single-element unions when the supplied column schema does not specify <uniontype>
What changes are proposed in this pull request, and why are they necessary?
When Avro schemas define single-element unions (e.g., ["string"] for primitives, [{"type":"record",...}] for array items/map values), the schema merge logic in MergeHiveSchemaWithAvro was incorrectly unwrapping them, changing non-nullable types to nullable types or losing the union structure entirely.
This issue occurs because some table systems (like OpenHouse) don't represent single-element unions as uniontype<...> in the Hive schema. When the Hive schema simply shows string or struct<...> instead of uniontype<string> or uniontype<struct<...>>, the merge logic treats them as regular types and unwraps the Avro union.
Changes made:
-
HiveSchemaWithPartnerVisitor.java: Added logic at the top level of thevisit()method to:- Detect when the Avro partner schema has a single-element union but the Hive
TypeInfodoesn't declare a union (viashouldUnwrapPartner()) - Temporarily unwrap the partner before the switch statement for proper type matching
- Rewrap the result after processing to preserve the original Avro structure
- Detect when the Avro partner schema has a single-element union but the Hive
-
SchemaUtilities.java:- Added
wrapInSingleElementUnion()utility method along withextractSingleElementUnion() - These methods provide a consistent API for single-element union handling
- Added
-
Key behavior: The fix only applies when there's a mismatch between Hive and Avro:
- If Hive declares
uniontype<...>, normal union processing occurs (no unwrap/rewrap) - If Hive doesn't declare
uniontypebut Avro has a single-element union, we unwrap for matching then rewrap to preserve structure
- If Hive declares
This ensures that Avro schemas with single-element unions (commonly found in avro.schema.literal properties) maintain their exact structure after merging, preventing unintended nullability changes.
How was this patch tested?
Added two comprehensive unit tests in MergeHiveSchemaWithAvroTests:
-
shouldHandleSingleElementUnionsInAllTypes:- Tests single-element unions in primitives (
["string"],["boolean"]), array items, and map values - Hive schema does NOT use
uniontype(simulating OpenHouse-style schemas) - Verifies that all single-element unions are preserved in the output
- Tests single-element unions in primitives (
-
shouldHandleSingleElementUnionsWithHiveUnionType:- Tests backward compatibility when Hive explicitly declares
uniontype<...> - Includes both regular multi-type unions and single-element unions
- Ensures the fix doesn't interfere with normal union processing
- Tests backward compatibility when Hive explicitly declares
Both tests verify that:
- The input Avro schema structure is preserved (single-element unions remain wrapped)
- Field names, nullability, docs, and default values are maintained
- Complex nested structures (records within arrays/maps) work correctly
All existing tests continue to pass, confirming no regressions.
Also manually ran integration tests in spark, by generating a jar. Confirmed that the original symptoms are now fixed, ie. view on top of the table that had the single-element-union now show correct nullability for all fields
Also ran integration regression tests with coral-tools:
rexec ./coral-tools/scripts/rdevRegressionCheck.sh trino 2.2.54 https://github.com/ruolin59/coral.git fix-array-and-map-union-extraction 1-10
diff coral_v2.2.snapshot_failures.txt coral_v2.2.54_failures.txt yields no outputs;
diff coral_v2.2.54_TrinoSQL.txt coral_v2.2.snapshot_TrinoSQL.txt yields no outputs;
diff coral_v2.2.54_successes.txt coral_v2.2.snapshot_successes.txt yields no outputs
The description is not quite adding up since "unions" are not a thing in Coral. Could you rewrite the description using "records" which is the actual primitive? That would help understand the problem better.
If you really need to use unions for the description, you may talk about unions in the Hive type system, then records in the Coral type system but you have to describe the whole flow and conversion chain with that segregation (but not mix unions with Coral type system description). If unions are not needed and we can state the problem just with records, then even better.
@wmoustafa the "union" here isn't a union related to Hive, but union as related to avro literal schema definition. This is addressed in the description where it says
This causes a type mismatch when comparing against Hive schemas, which don't represent these structures as unions, leading to incorrect schema generation where fields that should be non-nullable (like "type": "string") become nullable (like "type": ["null", "string"])
the "union" here is more referring to the fact that it's de-facto either a union of [element, null], or "single-union" where it's just a union of [element]. Not sure if that make sense
From Coral-Schema’s perspective, it converges hive schema & avro schema literal. The input here is a "single-element union", but it isn’t represented as a "uniontype" in the passed hive schema, which is unexpected. This appears to be a caller-side bug. The caller should supply it explicitly as uniontype; we shouldn’t introduce special handling in Coral-Schema to infer user intent.