fix: preserve more dictionaries when coercing types
Which issue does this PR close?
Closes https://github.com/apache/datafusion/issues/10220
Rationale for this change
Loosen the restriction on when type coercion will preserve dictionary types to prevent slow casting of columns with dictionary type.
What changes are included in this PR?
Are these changes tested?
There are already tests that would fail on type coercion. Let me know if there additional tests you'd like to see.
Are there any user-facing changes?
Certain coercion operations will now preserve dictionary encoding where before they would downcast to the value type
There is a test failure in sqllogictests
External error: query failed: DataFusion error: Error during planning: No function matches the given name and argument types 'coalesce(Int64, Dictionary(Int32, Int8))'. You might need to add explicit type casts.
Candidate functions:
coalesce(CoercibleT, .., CoercibleT)
[SQL] select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
at test_files/scalar.slt:1794
here is the test case: https://github.com/erratic-pattern/arrow-datafusion/blob/3a6c04ec8f989ddcf3b13a871a782c0bf56b9f8e/datafusion/sqllogictest/test_files/scalar.slt#L1795
There could be unintended consequences to this change for casting so I will look into it
Can we please add a test to https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt with the explain plan showing the correct filter being added?
The test failure is from comparison_coercion being used in function argument type coercion for the coalesce function and other Signature::VariadicEqual functions.
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/functions.rs#L186-L201
There are other places comparison_coercion is used so we may want to look for additional unintended side effects of this change.
array prepend/append: https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/functions.rs#L138
list type coercions: https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/other.rs#L25-L34
CASE WHEN ... :
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/other.rs#L39-L54
IN subquery:
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/optimizer/src/analyzer/type_coercion.rs#L151-L172
BETWEEN:
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/optimizer/src/analyzer/type_coercion.rs#L236-L270
It looks like the failing test case is no longer throwing an error after I added @jayzhan211 's suggestion:
DataFusion CLI v37.1.0
> select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
+----------------------------------------------------------------------------+
| coalesce(Int64(34),arrow_cast(Int64(123),Utf8("Dictionary(Int32, Int8)"))) |
+----------------------------------------------------------------------------+
| 34 |
+----------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.071 seconds.
I think it's now complaining that the type is no longer an integer and is instead Dictionary<Int32, Int8>
External error: query columns mismatch:
[SQL] select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
[Expected] [I]
[Actual ] [?]
at test_files/scalar.slt:1794
Marking as draft as I think this PR is still in progress and the CI is not yet passing
I think it's now complaining that the type is no longer an integer and is instead Dictionary<Int32, Int8>
That seems like you could just update the test. Perhaps via https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#updating-tests-completion-mode
I may try to approach a solution from a different angle here, to avoid needing to refactor the type coercion logic too much without any clear understanding of what the impacts could be.
I think a more specific optimization might look something like
ValueType(column) <binary op> CAST(<constant or expression> as ValueType)
to:
column <binary op> CAST(<constant or expression> as Dictionary<IndexType, ValueType>)
I may try to approach a solution from a different angle here, to avoid needing to refactor the type coercion logic too much without any clear understanding of what the impacts could be.
FWIW I think the coercion is the right approach here. Let me take a look at this PR and see if I can find something
Or even more generally, you could ignore whether the operands are column references or constants and just have:
ValueType(<expr> :: Dictionary<IndexType, ValueType>) <binary op> ValueType(<expr> :: <any type>)
becomes:
(<expr> :: Dictionary<IndexType, ValueType>) <binary op> CAST(<expr> as Dictionary<IndexType, ValueType>)
I may try to approach a solution from a different angle here, to avoid needing to refactor the type coercion logic too much without any clear understanding of what the impacts could be.
FWIW I think the coercion is the right approach here. Let me take a look at this PR and see if I can find something
I think it is the right approach for this one particular case, but the coalesce example in the tests is a good example of where you probably do want to convert to a simple integer type rather than converting everything to dictionaries.
Maybe we need to decouple the non-comparison type coercions from using the comparison_coercion function and have their own coercion function(s). For example VariadicEqual functions can use a new operand_coercion function which does the same thing as the previous comparison_coercion function, but with preserve_dictionaries flag set to false.
this comment also seems to hint at a possible refactor: https://github.com/apache/datafusion/blob/f47e74d6d52000eed6f9472bc2cc43a16b8814a8/datafusion/expr/src/type_coercion/functions.rs#L190-L194
we could potentially rewrite coalesce coercion to use maybe_data_type which allows restricting coercion to a list of output types
https://github.com/apache/datafusion/blob/f47e74d6d52000eed6f9472bc2cc43a16b8814a8/datafusion/expr/src/type_coercion/functions.rs#L262-L271
This diff shows a list of types that coalesce can accept. There might be others but this would serve as a good starting point: https://github.com/apache/datafusion/pull/9459/files#diff-9b644c479dfb999609b40e8da6d3b2a40c7adadf88296eeefd2f803201e7ab6dL25-L43
I think it's now complaining that the type is no longer an integer and is instead Dictionary<Int32, Int8>
That seems like you could just update the test. Perhaps via https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#updating-tests-completion-mode
I updated the tests, but I'm still unsure if it's okay to ignore this change in behavior. It seems like this could be a regression in other cases where we might be converting things to dictionaries for no reason.
It would be nice to have some issue tracking this, or maybe there is one already.
A possibly easy change to make in the short-term is to refactor away from using comparison_coercion for non-comparisons and instead use coercion logic specifically tailored to those expressions.
I just feels like we're playing Jenga with type coercion logic right now. The coercion logic should prioritize avoiding expensive casts, I think.
I updated the tests, but I'm still unsure if it's okay to ignore this change in behavior. It seems like this could be a regression in other cases where we might be converting things to dictionaries for no reason.
I think the reason to convert to a dictonary is that the other side of the comparsion is already a dictionary, which thus avoids convering both arguments (though I may be missing something)
Maybe @viirya has some thoughts given he worked on coercion as part of https://github.com/apache/datafusion/pull/9459 recently
It would be nice to have some issue tracking this, or maybe there is one already.
Perhaps https://github.com/apache/datafusion/issues/5928 ?
There appear to be a bunch of open tickets about coercion https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+coercion
A possibly easy change to make in the short-term is to refactor away from using
comparison_coercionfor non-comparisons and instead use coercion logic specifically tailored to those expressions.
This seems reasonable to me (aka have special rules for coerce). Other functions like BETWEEN I think can be thought of as syntactic sugar for comparisons (namely x > low AND x < high)
I just feels like we're playing Jenga with type coercion logic right now. The coercion logic should prioritize avoiding expensive casts, I think.
The key to ensuring we don't break existing code is to rely on the tests.
I agree the type coercion logic could be improved -- specifically I think it needs to have some encapsulation rather than being spread out through a bunch of free functions.
Is this something you are interested in helping out with? I think the first thing to do would be to try and explain how the current logic works (and in that process we will likely uncover ways to make it better). Then, would improve the code the structure to encapsulate the logic (into structs / enums perhaps -- like I did recently in https://github.com/apache/datafusion/pull/10216). Once the logic is more encapsulated, then I think we'll be able to reason about it and feel good about how it fits into the overall picture
I think the difference in casting a scalar integer to a scalar dictionary is neglible. The difference casting a column to a different type is likely subtantial (though casting string --> dictionary doesn't require any data copying)
For this PR I suggest:
- We add tests showing the desired behavior (I will push a commit or two)
- Update the coercion tests to show the resulting types
Then as a follow on, if you want to take on the refactoring of type coecion we can start doing that.
Maybe we need to decouple the non-comparison type coercions from using the comparison_coercion function and have their own coercion function(s). For example VariadicEqual functions can use a new operand_coercion function which does the same thing as the previous comparison_coercion function, but with preserve_dictionaries flag set to false.
I'm wondering why coalesce has the signature VariadicEqual, since it takes a first non-null value, ideally, coercion is not necessary. I think VariadicAny is more suitable.
In this PR, coalesce do the coercion internally, I forgot why we do coercion but not returning first non-null value with the type it has
Output in this PR:
query IT
select coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64')), arrow_typeof(coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64')))
----
1 Int64
Expect:
1 Int32
Change to VariadicAny, I got Int32.
I'm wondering why coalesce has the signature VariadicEqual, since it takes a first non-null value, ideally, coercion is not necessary. I think
VariadicAnyis more suitable.In this PR, coalesce do the coercion internally, I forgot why we do coercion but not returning first non-null value with the type it has
I can't remember either but I remember it was tricky -- I suggest we open a new ticket / discussion about it rather than trying to change the behavior in this PR
rebased onto main
I had a chat with @erratic-pattern the plan:
We'll put this PR into draft mode
- He is going to make a ticket + PR will improve unwrap cast comparison to handle dictionaries
- I will review https://github.com/apache/datafusion/pull/10268 from @jayzhan211 and hopefully get that sorted out
- Once that is complete, we'll revisit the approach in this PR and decide how to proceed.
I believe this is superceded by https://github.com/apache/datafusion/pull/10323