[DISCUSSION] Align unnamed expressions naming with Snowflake

Open osipovartem opened this issue 6 months ago • 0 comments

The original issue is #1118

✅ Goal We want the projection column names to look like SYSTEM$TYPEOF(2 / 3) and not the auto-generated internal form like arrow_typeof(Int64(2) / Int64(3)). This improves usability and aligns with how Snowflake shows column names for expressions in SELECT.

🔧 Options Overview

1. Patch Expr::schema_name formatting (SchemaDisplay)

Description: Update the formatting logic used for displaying column names in Expr::schema_name or display_name.
Pros:
- Full control over column naming.
- Consistent and automatic across all planners.
Cons
- A lot of changes in upstream DataFusion code.
- Will require many test updates.
- High maintenance in a fork.

2. Override schema_name() in specific ScalarUDFImpl implementations

Description: Customize the column naming logic per function (e.g. arrow_typeof) by overriding ScalarUDFImpl::schema_name().
Pros:
- Low-risk change, localized to specific UDFs.
- Avoids forking or modifying upstream Expr.
- Compatible with existing DataFusion extension points.
Cons:
- Only works for UDFs we control.
- Still shows verbose args like Int64(2) unless display logic is also adjusted.

3. Inject Alias(...) into the AST early (e.g. during parsing)

Description: Modify the AST during planning to insert Alias for certain expressions like ScalarFunction.
Pros:
- Simple in principle.
- Works before schema is built.
Cons:
- Dangerous: hard to determine when aliasing is legal (e.g., WHERE, GROUP BY).
- Easy to introduce semantic bugs.
- Breaks user expectations (e.g. raw expression vs named column).

4. Add an OptimizerRule or AnalyzerRule to inject aliases

Description: Add an optimization rule to traverse projections and wrap supported expressions in Alias(...) with cleaner names.
Pros:
- More robust than AST modification.
- Scoped to Projection nodes only.
- Fits naturally into DataFusion optimizer.
Cons:
- Does not update the schema unless Projection::try_new_with_schema is also used.
- Must mirror logic in schema_name to match aliases.
- Might cause confusion if other rules mutate the projection later.

5. Post-process DataFrame schema after execution

Description: After building the final plan or executing it, replace the column names in the schema for final display.
Pros:
- Localized and safe: only affects presentation.
- Easy to implement — just rename schema fields before display.
- Avoids deep planner changes.
Cons:
- Cosmetic only — logical plan and schema internals still use verbose names.
- May mismatch if users rely on schema introspection in code.

✅ Recommendation For Snowflake-like behavior and minimal upstream impact:

Short-term: Option 2 (override schema_name() in UDFs) is the safest and cleanest, especially for custom functions like arrow_typeof.
Long-term: Combine Option 4 (AnalyzerRule for alias insertion) with consistent naming logic shared with schema_name(), to enforce consistent column names in logical plans.
Option 5 can be used as a fallback to clean up display output post-query without impacting execution.
Option 1 is only worth pursuing if upstream maintainers agree to accept such invasive changes.

Jun 18 '25 16:06 osipovartem