chore: Bump arrow-rs to 53.1.0 and datafusion
Which issue does this PR close?
Rationale for this change
Arrow-rs 53.1.0 includes performance improvements
What changes are included in this PR?
Bumping to arrow-rs 53.1.0 and datafusion to a revision
How are these changes tested?
existing tests
@alamb @jayzhan211 I am still investigating but looks like there is a regression in DataFusion related to https://github.com/apache/datafusion/pull/12753
assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
left: Struct([Field { name: "a", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
right: Struct([Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
at comet::errors::init::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:151)
at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2036)
at std::panicking::rust_panic_with_hook(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:799)
at std::panicking::begin_panic_handler::{{closure}}(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664)
at std::sys_common::backtrace::__rust_end_short_backtrace(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171)
at rust_begin_unwind(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652)
at core::panicking::panic_fmt(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72)
at core::panicking::assert_failed_inner(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:404)
at core::panicking::assert_failed(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364)
at arrow_data::transform::MutableArrayData::with_capacities(/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-53.1.0/src/transform/mod.rs:418)
at datafusion_functions_nested::make_array::array_array(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:223)
at datafusion_functions_nested::make_array::make_array_inner(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:153)
at core::ops::function::Fn::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:79)
at datafusion_functions_nested::utils::make_scalar_function::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/utils.rs:83)
at <datafusion_functions_nested::make_array::MakeArray as datafusion_expr::udf::ScalarUDFImpl>::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:96)
at datafusion_expr::udf::ScalarUDF::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/expr/src/udf.rs:197)
at <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-expr/src/scalar_function.rs:145)
at datafusion_physical_plan::projection::ProjectionStream::batch_project::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-plan/src/projection.rs:294)
Especially https://github.com/apache/datafusion/pull/12753/files#diff-f1e354d4fe26237064d8194e10a6008efa4f88e2b68b8a8352086a5d011180b8R108 type_union_resolution returns None for array that contains both Utf8 and Dictionary(Int32, Utf8)
Previously, coerce_types was coercing such cases. Not sure if this is an intentional change...
Just realized that type_union_resolution_coercion not handling struct...
We need to support Struct, but I'm not sure whether we should discard dictionary.
I assume if you have dictionary type at the first place, you expect to do some optimization based on dictionary type, so if we have primitive and dictionary type, I think it makes sense to keep dictionary type and coerce the primitive type to dictionary type.
Previous code utilize comparison coercion, in this case we just care about comparison therefore we don't need dictionary type at all.
Confirmed it was about adding structure support. Opened https://github.com/apache/datafusion/issues/12843 Ideal to fix it before the next release, otherwise it is a regression. @jayzhan211 @alamb For this specific test case to pass, I just needed to add
.or_else(|| struct_coercion(lhs_type, rhs_type))
at the end of type_union_resolution_coercion. Dictionary is fine, it seems to be working.
@andygrove @comphead @huaxingao @mbutrovich @parthchandra @viirya
Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.
Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.
I think we can start the DF 43 release prep this week
Thanks, merged @jayzhan211 @comphead @andygrove @parthchandra @viirya