datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

chore: Bump arrow-rs to 53.1.0 and datafusion

Open kazuyukitanimura opened this issue 1 year ago • 4 comments

Which issue does this PR close?

Rationale for this change

Arrow-rs 53.1.0 includes performance improvements

What changes are included in this PR?

Bumping to arrow-rs 53.1.0 and datafusion to a revision

How are these changes tested?

existing tests

kazuyukitanimura avatar Oct 08 '24 00:10 kazuyukitanimura

@alamb @jayzhan211 I am still investigating but looks like there is a regression in DataFusion related to https://github.com/apache/datafusion/pull/12753

assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
  left: Struct([Field { name: "a", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
 right: Struct([Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
        at comet::errors::init::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:151)
        at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2036)
        at std::panicking::rust_panic_with_hook(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:799)
        at std::panicking::begin_panic_handler::{{closure}}(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664)
        at std::sys_common::backtrace::__rust_end_short_backtrace(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171)
        at rust_begin_unwind(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652)
        at core::panicking::panic_fmt(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72)
        at core::panicking::assert_failed_inner(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:404)
        at core::panicking::assert_failed(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364)
        at arrow_data::transform::MutableArrayData::with_capacities(/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-53.1.0/src/transform/mod.rs:418)
        at datafusion_functions_nested::make_array::array_array(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:223)
        at datafusion_functions_nested::make_array::make_array_inner(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:153)
        at core::ops::function::Fn::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:79)
        at datafusion_functions_nested::utils::make_scalar_function::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/utils.rs:83)
        at <datafusion_functions_nested::make_array::MakeArray as datafusion_expr::udf::ScalarUDFImpl>::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:96)
        at datafusion_expr::udf::ScalarUDF::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/expr/src/udf.rs:197)
        at <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-expr/src/scalar_function.rs:145)
        at datafusion_physical_plan::projection::ProjectionStream::batch_project::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-plan/src/projection.rs:294)

Especially https://github.com/apache/datafusion/pull/12753/files#diff-f1e354d4fe26237064d8194e10a6008efa4f88e2b68b8a8352086a5d011180b8R108 type_union_resolution returns None for array that contains both Utf8 and Dictionary(Int32, Utf8)

Previously, coerce_types was coercing such cases. Not sure if this is an intentional change...

kazuyukitanimura avatar Oct 09 '24 22:10 kazuyukitanimura

Just realized that type_union_resolution_coercion not handling struct...

kazuyukitanimura avatar Oct 09 '24 23:10 kazuyukitanimura

We need to support Struct, but I'm not sure whether we should discard dictionary.

I assume if you have dictionary type at the first place, you expect to do some optimization based on dictionary type, so if we have primitive and dictionary type, I think it makes sense to keep dictionary type and coerce the primitive type to dictionary type.

Previous code utilize comparison coercion, in this case we just care about comparison therefore we don't need dictionary type at all.

jayzhan211 avatar Oct 09 '24 23:10 jayzhan211

Confirmed it was about adding structure support. Opened https://github.com/apache/datafusion/issues/12843 Ideal to fix it before the next release, otherwise it is a regression. @jayzhan211 @alamb For this specific test case to pass, I just needed to add

.or_else(|| struct_coercion(lhs_type, rhs_type))

at the end of type_union_resolution_coercion. Dictionary is fine, it seems to be working.

kazuyukitanimura avatar Oct 10 '24 06:10 kazuyukitanimura

@andygrove @comphead @huaxingao @mbutrovich @parthchandra @viirya

kazuyukitanimura avatar Oct 11 '24 17:10 kazuyukitanimura

Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.

kazuyukitanimura avatar Oct 11 '24 21:10 kazuyukitanimura

Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.

I think we can start the DF 43 release prep this week

andygrove avatar Oct 14 '24 14:10 andygrove

Thanks, merged @jayzhan211 @comphead @andygrove @parthchandra @viirya

kazuyukitanimura avatar Oct 14 '24 22:10 kazuyukitanimura