datafusion
datafusion copied to clipboard
simplify_expressions don't support different data type for binary
Describe the bug A clear and concise description of what the bug is.
in the test case select_distinct_from
Run the test case, we will meet the error for optimizer:
[2022-09-21T05:22:53Z WARN datafusion_optimizer::optimizer] Skipping optimizer rule 'simplify_expressions' due to unexpected error: Internal error: The type of Int64 IS DISTINCT FROM Int32 o
f binary physical should be same. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[2022-09-21T05:22:53Z DEBUG datafusion_optimizer::optimizer] After apply pre_cast_lit_in_comparison rule:
In the optimizer we should these rules about type coercion first, and then do other optimization. And the simplify expression should not depend on physical expr
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.
In the logical phase, we implement the expr simplify with the helper of the physical expr.
cc @alamb
block this pr https://github.com/apache/arrow-datafusion/pull/3396
I find many test cases meet this issue:
thread 'simplify_expressions::tests::now_less_than_timestamp' panicked at 'failed to optimize plan: Internal("The type of Int64 + Int32 of binary physical should be same")', datafusion/optimizer/src/simplify_expressions.rs:1911:14
If I remove the type coercion from the physical expr.
Now I will reassign the optimizer rule, the TypeCoercion should be the first rule in the optimizer.
@liukun4515 I wonder if we could make type coercion part of "simplify_expressions" 🤔 that way it would always be called at the appropriate time 🤔
I wonder if we could make type coercion part of "simplify_expressions" 🤔 that way it would always be called at the appropriate time 🤔
I think it is not a good way do this.
simplify_expressions rule can be called many times, but the type_coercion just can do only once.
We should do the type_coercion before all operation and optimization.
I think it is most important: the right data type system is the base of the all things I submit a issue to talk about this thing https://github.com/apache/arrow-datafusion/issues/3582
@alamb
I think it is not a good way do this.
Upon consideration I agree with your assessment that combining type coercion and simplify_expressions is not a good idea. Thank you for filing https://github.com/apache/arrow-datafusion/issues/3582
I think it is not a good way do this.
Upon consideration I agree with your assessment that combining type coercion and simplify_expressions is not a good idea. Thank you for filing #3582
If community think the refactor or https://github.com/apache/arrow-datafusion/issues/3582 is on the right/good path, I can do them next.
I think it is not a good way do this.
Upon consideration I agree with your assessment that combining type coercion and simplify_expressions is not a good idea. Thank you for filing #3582
But in the pr https://github.com/apache/arrow-datafusion/pull/3396, I follow your comments temporarily. We should remove the type coercion from the physical phase first.