datafusion
datafusion copied to clipboard
move the type coercion out of the optimizer and refactor the optimizer
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When I do this pr https://github.com/apache/arrow-datafusion/pull/3396/files# I am stuck by many issue about the type in the optimizer framework.
In other SQL or database system, the data type coercion should be done before any optimization.
But we do it in the physical phase, after the @andygrove work and add the TypeCoercion
rule in the optimizer. It can be done in the logical phase.
After the https://github.com/apache/arrow-datafusion/pull/3396/files# merge, we need to do the refactor.
If the data type is not right, all the operation and optimization will meet some wired issue.
cc @andygrove @alamb
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
I meet the issues, and all of them will be fixed for this issue
- [x] https://github.com/apache/arrow-datafusion/issues/3583
- [x] https://github.com/apache/arrow-datafusion/issues/3565
- [x] https://github.com/apache/arrow-datafusion/issues/3568
- [x] https://github.com/apache/arrow-datafusion/issues/3557
- [x] https://github.com/apache/arrow-datafusion/issues/3556
- [x] https://github.com/apache/arrow-datafusion/issues/3587
- [x] https://github.com/apache/arrow-datafusion/issues/3622 change the rule
- [x] https://github.com/apache/arrow-datafusion/issues/3635
- [x] ScalarFunction
- [x] https://github.com/apache/arrow-datafusion/issues/3731
- [x] temporary fix for the contract function: https://github.com/apache/arrow-datafusion/pull/3721
- [x] ScalarUDF https://github.com/apache/arrow-datafusion/issues/3734
- [x] WindowFunction
- [x] AggregateUDF/AggregateFunction https://github.com/apache/arrow-datafusion/issues/3752 https://github.com/apache/arrow-datafusion/issues/3623 AggregateFunction
- [x] https://github.com/apache/arrow-datafusion/issues/3582
- [x] https://github.com/apache/arrow-datafusion/issues/3673
- [x] Consolidate type coercion: https://github.com/apache/arrow-datafusion/pull/3728 @alamb
I agree that type coercion doesn't really belong in an "optimizer" as it actually changes the meaning of the exprs (on purpose) where the other optimzier passes are supposed to keep the meaning of the plan the same, but make it faster in some way.
I recommend changing the code to explicitly run the type coercion logic immediately prior to running any other optimization passes. I also think the type coercion logic should not depend on any other simplification (such as constant folding) -- it should depend only on the types.
Once type coercion is done, then we can run the expr simplifier pass to clean up / simplify the expressions
I agree - the type coercion should be a mandatory pass that is done after planning.
I agree that type coercion doesn't really belong in an "optimizer" as it actually changes the meaning of the exprs (on purpose) where the other optimzier passes are supposed to keep the meaning of the plan the same, but make it faster in some way.
I recommend changing the code to explicitly run the type coercion logic immediately prior to running any other optimization passes. I also think the type coercion logic should not depend on any other simplification (such as constant folding) -- it should depend only on the types.
Once type coercion is done, then we can run the expr simplifier pass to clean up / simplify the expressions
You got it.
Spark has an Analysis phase that runs before the optimizer. Maybe we can learn from their model. Within the analysis phase, they have a number of type coercion rules that they run:
override def typeCoercionRules: List[Rule[LogicalPlan]] =
UnpivotCoercion ::
WidenSetOperationTypes ::
new CombinedTypeCoercionRule(
InConversion ::
PromoteStrings ::
DecimalPrecision ::
BooleanEquality ::
FunctionArgumentConversion ::
ConcatCoercion ::
MapZipWithCoercion ::
EltCoercion ::
CaseWhenCoercion ::
IfCoercion ::
StackCoercion ::
Division ::
IntegralDivision ::
ImplicitTypeCasts ::
DateTimeOperations ::
WindowFrameCoercion ::
StringLiteralCoercion :: Nil) :: Nil
Spark has an Analysis phase that runs before the optimizer. Maybe we can learn from their model. Within the analysis phase, they have a number of type coercion rules that they run:
override def typeCoercionRules: List[Rule[LogicalPlan]] = UnpivotCoercion :: WidenSetOperationTypes :: new CombinedTypeCoercionRule( InConversion :: PromoteStrings :: DecimalPrecision :: BooleanEquality :: FunctionArgumentConversion :: ConcatCoercion :: MapZipWithCoercion :: EltCoercion :: CaseWhenCoercion :: IfCoercion :: StackCoercion :: Division :: IntegralDivision :: ImplicitTypeCasts :: DateTimeOperations :: WindowFrameCoercion :: StringLiteralCoercion :: Nil) :: Nil
Yes, I think we can follow other database or sql system.
But It's long way to go and refactor.
My plan is to do some refactor, and make good base for the next big refactor like the spark or other system.
My plan is to do some refactor, and make good base for the next big refactor like the spark or other system.
Sounds like a great plan to me
Here is a PR that contributes to this goal: https://github.com/apache/arrow-datafusion/pull/3728
Here is a PR that contributes to this goal: #3728
After this pr merged I will do other Expr for type coercion
after supporting all expr for type coercion, we can move this rule out of the optimizer, and use it as a separate module