datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

move the type coercion out of the optimizer and refactor the optimizer

Open liukun4515 opened this issue 2 years ago • 5 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When I do this pr https://github.com/apache/arrow-datafusion/pull/3396/files# I am stuck by many issue about the type in the optimizer framework.

In other SQL or database system, the data type coercion should be done before any optimization.

But we do it in the physical phase, after the @andygrove work and add the TypeCoercion rule in the optimizer. It can be done in the logical phase.

After the https://github.com/apache/arrow-datafusion/pull/3396/files# merge, we need to do the refactor.

If the data type is not right, all the operation and optimization will meet some wired issue.

cc @andygrove @alamb

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

liukun4515 avatar Sep 22 '22 02:09 liukun4515

I meet the issues, and all of them will be fixed for this issue

  • [x] https://github.com/apache/arrow-datafusion/issues/3583
  • [x] https://github.com/apache/arrow-datafusion/issues/3565
  • [x] https://github.com/apache/arrow-datafusion/issues/3568
  • [x] https://github.com/apache/arrow-datafusion/issues/3557
  • [x] https://github.com/apache/arrow-datafusion/issues/3556
  • [x] https://github.com/apache/arrow-datafusion/issues/3587
  • [x] https://github.com/apache/arrow-datafusion/issues/3622 change the rule
  • [x] https://github.com/apache/arrow-datafusion/issues/3635
  • [x] ScalarFunction
    • [x] https://github.com/apache/arrow-datafusion/issues/3731
    • [x] temporary fix for the contract function: https://github.com/apache/arrow-datafusion/pull/3721
  • [x] ScalarUDF https://github.com/apache/arrow-datafusion/issues/3734
  • [x] WindowFunction
  • [x] AggregateUDF/AggregateFunction https://github.com/apache/arrow-datafusion/issues/3752 https://github.com/apache/arrow-datafusion/issues/3623 AggregateFunction
  • [x] https://github.com/apache/arrow-datafusion/issues/3582
  • [x] https://github.com/apache/arrow-datafusion/issues/3673
  • [x] Consolidate type coercion: https://github.com/apache/arrow-datafusion/pull/3728 @alamb

liukun4515 avatar Sep 22 '22 02:09 liukun4515

I agree that type coercion doesn't really belong in an "optimizer" as it actually changes the meaning of the exprs (on purpose) where the other optimzier passes are supposed to keep the meaning of the plan the same, but make it faster in some way.

I recommend changing the code to explicitly run the type coercion logic immediately prior to running any other optimization passes. I also think the type coercion logic should not depend on any other simplification (such as constant folding) -- it should depend only on the types.

Once type coercion is done, then we can run the expr simplifier pass to clean up / simplify the expressions

alamb avatar Sep 22 '22 11:09 alamb

I agree - the type coercion should be a mandatory pass that is done after planning.

Dandandan avatar Sep 22 '22 11:09 Dandandan

I agree that type coercion doesn't really belong in an "optimizer" as it actually changes the meaning of the exprs (on purpose) where the other optimzier passes are supposed to keep the meaning of the plan the same, but make it faster in some way.

I recommend changing the code to explicitly run the type coercion logic immediately prior to running any other optimization passes. I also think the type coercion logic should not depend on any other simplification (such as constant folding) -- it should depend only on the types.

Once type coercion is done, then we can run the expr simplifier pass to clean up / simplify the expressions

You got it.

liukun4515 avatar Sep 22 '22 12:09 liukun4515

Spark has an Analysis phase that runs before the optimizer. Maybe we can learn from their model. Within the analysis phase, they have a number of type coercion rules that they run:

  override def typeCoercionRules: List[Rule[LogicalPlan]] =
    UnpivotCoercion ::
    WidenSetOperationTypes ::
    new CombinedTypeCoercionRule(
      InConversion ::
      PromoteStrings ::
      DecimalPrecision ::
      BooleanEquality ::
      FunctionArgumentConversion ::
      ConcatCoercion ::
      MapZipWithCoercion ::
      EltCoercion ::
      CaseWhenCoercion ::
      IfCoercion ::
      StackCoercion ::
      Division ::
      IntegralDivision ::
      ImplicitTypeCasts ::
      DateTimeOperations ::
      WindowFrameCoercion ::
      StringLiteralCoercion :: Nil) :: Nil

andygrove avatar Sep 22 '22 17:09 andygrove

Spark has an Analysis phase that runs before the optimizer. Maybe we can learn from their model. Within the analysis phase, they have a number of type coercion rules that they run:

  override def typeCoercionRules: List[Rule[LogicalPlan]] =
    UnpivotCoercion ::
    WidenSetOperationTypes ::
    new CombinedTypeCoercionRule(
      InConversion ::
      PromoteStrings ::
      DecimalPrecision ::
      BooleanEquality ::
      FunctionArgumentConversion ::
      ConcatCoercion ::
      MapZipWithCoercion ::
      EltCoercion ::
      CaseWhenCoercion ::
      IfCoercion ::
      StackCoercion ::
      Division ::
      IntegralDivision ::
      ImplicitTypeCasts ::
      DateTimeOperations ::
      WindowFrameCoercion ::
      StringLiteralCoercion :: Nil) :: Nil

Yes, I think we can follow other database or sql system.

But It's long way to go and refactor.

My plan is to do some refactor, and make good base for the next big refactor like the spark or other system.

liukun4515 avatar Sep 23 '22 12:09 liukun4515

My plan is to do some refactor, and make good base for the next big refactor like the spark or other system.

Sounds like a great plan to me

alamb avatar Sep 23 '22 12:09 alamb

Here is a PR that contributes to this goal: https://github.com/apache/arrow-datafusion/pull/3728

alamb avatar Oct 05 '22 18:10 alamb

Here is a PR that contributes to this goal: #3728

After this pr merged I will do other Expr for type coercion

liukun4515 avatar Oct 06 '22 07:10 liukun4515

after supporting all expr for type coercion, we can move this rule out of the optimizer, and use it as a separate module

liukun4515 avatar Oct 09 '22 02:10 liukun4515