embucket-labs
embucket-labs copied to clipboard
[DISCUSSION] Align unnamed expressions naming with Snowflake
The original issue is #1118
✅ Goal
We want the projection column names to look like SYSTEM$TYPEOF(2 / 3) and not the auto-generated internal form like arrow_typeof(Int64(2) / Int64(3)). This improves usability and aligns with how Snowflake shows column names for expressions in SELECT.
🔧 Options Overview
1. Patch Expr::schema_name formatting (SchemaDisplay)
- Description: Update the formatting logic used for displaying column names in Expr::schema_name or display_name.
- Pros:
- Full control over column naming.
- Consistent and automatic across all planners.
- Cons
- A lot of changes in upstream DataFusion code.
- Will require many test updates.
- High maintenance in a fork.
2. Override schema_name() in specific ScalarUDFImpl implementations
- Description: Customize the column naming logic per function (e.g. arrow_typeof) by overriding ScalarUDFImpl::schema_name().
- Pros:
- Low-risk change, localized to specific UDFs.
- Avoids forking or modifying upstream Expr.
- Compatible with existing DataFusion extension points.
- Cons:
- Only works for UDFs we control.
- Still shows verbose args like Int64(2) unless display logic is also adjusted.
3. Inject Alias(...) into the AST early (e.g. during parsing)
- Description: Modify the AST during planning to insert Alias for certain expressions like ScalarFunction.
- Pros:
- Simple in principle.
- Works before schema is built.
- Cons:
- Dangerous: hard to determine when aliasing is legal (e.g., WHERE, GROUP BY).
- Easy to introduce semantic bugs.
- Breaks user expectations (e.g. raw expression vs named column).
4. Add an OptimizerRule or AnalyzerRule to inject aliases
- Description: Add an optimization rule to traverse projections and wrap supported expressions in Alias(...) with cleaner names.
- Pros:
- More robust than AST modification.
- Scoped to Projection nodes only.
- Fits naturally into DataFusion optimizer.
- Cons:
- Does not update the schema unless Projection::try_new_with_schema is also used.
- Must mirror logic in schema_name to match aliases.
- Might cause confusion if other rules mutate the projection later.
5. Post-process DataFrame schema after execution
- Description: After building the final plan or executing it, replace the column names in the schema for final display.
- Pros:
- Localized and safe: only affects presentation.
- Easy to implement — just rename schema fields before display.
- Avoids deep planner changes.
- Cons:
- Cosmetic only — logical plan and schema internals still use verbose names.
- May mismatch if users rely on schema introspection in code.
✅ Recommendation For Snowflake-like behavior and minimal upstream impact:
- Short-term: Option 2 (override schema_name() in UDFs) is the safest and cleanest, especially for custom functions like arrow_typeof.
- Long-term: Combine Option 4 (AnalyzerRule for alias insertion) with consistent naming logic shared with schema_name(), to enforce consistent column names in logical plans.
- Option 5 can be used as a fallback to clean up display output post-query without impacting execution.
- Option 1 is only worth pursuing if upstream maintainers agree to accept such invasive changes.