embucket-labs icon indicating copy to clipboard operation
embucket-labs copied to clipboard

[DISCUSSION] Align unnamed expressions naming with Snowflake

Open osipovartem opened this issue 6 months ago • 0 comments

The original issue is #1118

Goal We want the projection column names to look like SYSTEM$TYPEOF(2 / 3) and not the auto-generated internal form like arrow_typeof(Int64(2) / Int64(3)). This improves usability and aligns with how Snowflake shows column names for expressions in SELECT.

🔧 Options Overview

1. Patch Expr::schema_name formatting (SchemaDisplay)

  • Description: Update the formatting logic used for displaying column names in Expr::schema_name or display_name.
  • Pros:
    • Full control over column naming.
    • Consistent and automatic across all planners.
  • Cons
    • A lot of changes in upstream DataFusion code.
    • Will require many test updates.
    • High maintenance in a fork.

2. Override schema_name() in specific ScalarUDFImpl implementations

  • Description: Customize the column naming logic per function (e.g. arrow_typeof) by overriding ScalarUDFImpl::schema_name().
  • Pros:
    • Low-risk change, localized to specific UDFs.
    • Avoids forking or modifying upstream Expr.
    • Compatible with existing DataFusion extension points.
  • Cons:
    • Only works for UDFs we control.
    • Still shows verbose args like Int64(2) unless display logic is also adjusted.

3. Inject Alias(...) into the AST early (e.g. during parsing)

  • Description: Modify the AST during planning to insert Alias for certain expressions like ScalarFunction.
  • Pros:
    • Simple in principle.
    • Works before schema is built.
  • Cons:
    • Dangerous: hard to determine when aliasing is legal (e.g., WHERE, GROUP BY).
    • Easy to introduce semantic bugs.
    • Breaks user expectations (e.g. raw expression vs named column).

4. Add an OptimizerRule or AnalyzerRule to inject aliases

  • Description: Add an optimization rule to traverse projections and wrap supported expressions in Alias(...) with cleaner names.
  • Pros:
    • More robust than AST modification.
    • Scoped to Projection nodes only.
    • Fits naturally into DataFusion optimizer.
  • Cons:
    • Does not update the schema unless Projection::try_new_with_schema is also used.
    • Must mirror logic in schema_name to match aliases.
    • Might cause confusion if other rules mutate the projection later.

5. Post-process DataFrame schema after execution

  • Description: After building the final plan or executing it, replace the column names in the schema for final display.
  • Pros:
    • Localized and safe: only affects presentation.
    • Easy to implement — just rename schema fields before display.
    • Avoids deep planner changes.
  • Cons:
    • Cosmetic only — logical plan and schema internals still use verbose names.
    • May mismatch if users rely on schema introspection in code.

✅ Recommendation For Snowflake-like behavior and minimal upstream impact:

  • Short-term: Option 2 (override schema_name() in UDFs) is the safest and cleanest, especially for custom functions like arrow_typeof.
  • Long-term: Combine Option 4 (AnalyzerRule for alias insertion) with consistent naming logic shared with schema_name(), to enforce consistent column names in logical plans.
  • Option 5 can be used as a fallback to clean up display output post-query without impacting execution.
  • Option 1 is only worth pursuing if upstream maintainers agree to accept such invasive changes.

osipovartem avatar Jun 18 '25 16:06 osipovartem