Panic in `datafusion_expr::window_state::WindowAggState::update`
Describe the bug
Upgrading Comet to use 48.0.0-rc2 causes tests to fail with a attempt to subtract with overflow panic. This did not happen with rc1. I have not debugged this yet to find the root cause.
PR: https://github.com/apache/datafusion-comet/pull/1853
failing build: https://github.com/apache/datafusion-comet/actions/runs/15491877086/job/43619110943?pr=1853
The relevant part of the stack trace is:
2025-06-06T13:57:54.1903145Z at datafusion_expr::window_state::WindowAggState::update(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/expr/src/window_state.rs:95)
2025-06-06T13:57:54.1905310Z at datafusion_physical_expr::window::window_expr::AggregateWindowExpr::aggregate_evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/window_expr.rs:260)
2025-06-06T13:57:54.1920612Z at <datafusion_physical_expr::window::aggregate::PlainAggregateWindowExpr as datafusion_physical_expr::window::window_expr::WindowExpr>::evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/aggregate.rs:148)
2025-06-06T13:57:54.1924024Z at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::compute_aggregates(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:983)
2025-06-06T13:57:54.1927398Z at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::poll_next_inner(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:1033)
2025-06-06T13:57:54.1930653Z at <datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream as futures_core::stream::Stream>::poll_next(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:949)
There was one PR between rc1 and rc2 specifically related to evaluating window expressions, so I wonder if that is the issue. I will try and confirm.
https://github.com/apache/datafusion/pull/16234
Full stack trace:
2025-06-06T13:57:54.1864287Z - aggregate window function for all types *** FAILED *** (406 milliseconds)
2025-06-06T13:57:54.1871363Z org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2045.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2045.0 (TID 5401) (62bae2d9d85a executor driver): org.apache.comet.CometNativeException: attempt to subtract with overflow
2025-06-06T13:57:54.1873529Z at comet::errors::init::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:151)
2025-06-06T13:57:54.1883399Z at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/alloc/src/boxed.rs:1980)
2025-06-06T13:57:54.1894489Z at std::panicking::rust_panic_with_hook(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:841)
2025-06-06T13:57:54.1895884Z at std::panicking::begin_panic_handler::{{closure}}(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:699)
2025-06-06T13:57:54.1897662Z at std::sys::backtrace::__rust_end_short_backtrace(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/sys/backtrace.rs:168)
2025-06-06T13:57:54.1899012Z at __rustc::rust_begin_unwind(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:697)
2025-06-06T13:57:54.1900180Z at core::panicking::panic_fmt(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:75)
2025-06-06T13:57:54.1901495Z at core::panicking::panic_const::panic_const_sub_overflow(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:178)
2025-06-06T13:57:54.1903145Z at datafusion_expr::window_state::WindowAggState::update(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/expr/src/window_state.rs:95)
2025-06-06T13:57:54.1905310Z at datafusion_physical_expr::window::window_expr::AggregateWindowExpr::aggregate_evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/window_expr.rs:260)
2025-06-06T13:57:54.1920612Z at <datafusion_physical_expr::window::aggregate::PlainAggregateWindowExpr as datafusion_physical_expr::window::window_expr::WindowExpr>::evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/aggregate.rs:148)
2025-06-06T13:57:54.1924024Z at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::compute_aggregates(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:983)
2025-06-06T13:57:54.1927398Z at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::poll_next_inner(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:1033)
2025-06-06T13:57:54.1930653Z at <datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream as futures_core::stream::Stream>::poll_next(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:949)
2025-06-06T13:57:54.1933599Z at <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-core-0.3.31/src/stream.rs:130)
2025-06-06T13:57:54.1935713Z at futures_util::stream::stream::StreamExt::poll_next_unpin(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/mod.rs:1638)
2025-06-06T13:57:54.1938604Z at <datafusion_physical_plan::projection::ProjectionStream as futures_core::stream::Stream>::poll_next(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/projection.rs:354)
2025-06-06T13:57:54.1940894Z at <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-core-0.3.31/src/stream.rs:130)
2025-06-06T13:57:54.1942871Z at futures_util::stream::stream::StreamExt::poll_next_unpin(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/mod.rs:1638)
2025-06-06T13:57:54.1945055Z at <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/next.rs:32)
2025-06-06T13:57:54.1947663Z at futures_util::future::future::FutureExt::poll_unpin(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/future/future/mod.rs:558)
2025-06-06T13:57:54.1949835Z at <futures_util::async_await::poll::PollOnce<F> as core::future::future::Future>::poll(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/async_await/poll.rs:37)
2025-06-06T13:57:54.1952041Z at comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}::{{closure}}::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:438)
2025-06-06T13:57:54.1954070Z at tokio::runtime::park::CachedParkThread::block_on::{{closure}}(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/park.rs:284)
2025-06-06T13:57:54.1955846Z at tokio::task::coop::with_budget(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/task/coop/mod.rs:167)
2025-06-06T13:57:54.1957636Z at tokio::task::coop::budget(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/task/coop/mod.rs:133)
2025-06-06T13:57:54.1959325Z at tokio::runtime::park::CachedParkThread::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/park.rs:284)
2025-06-06T13:57:54.1961375Z at tokio::runtime::context::blocking::BlockingRegionGuard::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/context/blocking.rs:66)
2025-06-06T13:57:54.1963697Z at tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/scheduler/multi_thread/mod.rs:87)
2025-06-06T13:57:54.1965887Z at tokio::runtime::context::runtime::enter_runtime(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/context/runtime.rs:65)
2025-06-06T13:57:54.1968188Z at tokio::runtime::scheduler::multi_thread::MultiThread::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/scheduler/multi_thread/mod.rs:86)
2025-06-06T13:57:54.1970246Z at tokio::runtime::runtime::Runtime::block_on_inner(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/runtime.rs:358)
2025-06-06T13:57:54.1972087Z at tokio::runtime::runtime::Runtime::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/runtime.rs:330)
2025-06-06T13:57:54.1974189Z at comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:438)
2025-06-06T13:57:54.1975895Z at comet::execution::tracing::with_trace(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/tracing.rs:117)
2025-06-06T13:57:54.1977694Z at comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:395)
2025-06-06T13:57:54.1979212Z at comet::errors::curry::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:485)
2025-06-06T13:57:54.1980462Z at std::panicking::try::do_call(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:589)
2025-06-06T13:57:54.1981370Z at __rust_try(__internal__:0)
2025-06-06T13:57:54.1982193Z at std::panicking::try(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:552)
2025-06-06T13:57:54.1983410Z at std::panic::catch_unwind(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panic.rs:359)
2025-06-06T13:57:54.1984614Z at comet::errors::try_unwrap_or_throw(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:499)
2025-06-06T13:57:54.1985938Z at Java_org_apache_comet_Native_executePlan(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:375)
2025-06-06T13:57:54.1987315Z at <unknown>(__internal__:0)
To Reproduce
No response
Expected behavior
No response
Additional context
No response
I also see a correctness issue in another test related to windowed aggregates:
2025-06-06T14:15:31.2495550Z [info] - postgreSQL/window_part1.sql *** FAILED *** (4 seconds, 628 milliseconds)
2025-06-06T14:15:31.2496326Z [info] postgreSQL/window_part1.sql
2025-06-06T14:15:31.2496774Z [info] Expected "...10
2025-06-06T14:15:31.2498107Z [info] 10
2025-06-06T14:15:31.2499258Z [info] 10
2025-06-06T14:15:31.2501018Z [info] 10
2025-06-06T14:15:31.2502765Z [info] 10
2025-06-06T14:15:31.2503832Z [info] 10
2025-06-06T14:15:31.2505296Z [info] 10[]", but got "...10
2025-06-06T14:15:31.2506923Z [info] 10
2025-06-06T14:15:31.2507701Z [info] 10
2025-06-06T14:15:31.2509163Z [info] 10
2025-06-06T14:15:31.2510556Z [info] 10
2025-06-06T14:15:31.2515385Z [info] 10
2025-06-06T14:15:31.2515856Z [info] 10[
2025-06-06T14:15:31.2516281Z [info] 20
2025-06-06T14:15:31.2517702Z [info] 20
2025-06-06T14:15:31.2520319Z [info] 20
2025-06-06T14:15:31.2521616Z [info] 20
2025-06-06T14:15:31.2523696Z [info] 20
2025-06-06T14:15:31.2525531Z [info] 20
2025-06-06T14:15:31.2527524Z [info] 20
2025-06-06T14:15:31.2529429Z [info] 20
2025-06-06T14:15:31.2536478Z [info] 20
2025-06-06T14:15:31.2536985Z [info] 20]" Result did not match for query #2
2025-06-06T14:15:31.2538025Z [info] SELECT COUNT(*) OVER () FROM tenk1 WHERE unique2 < 10 (SQLQueryTestSuite.scala:663)
Likely cause:
- https://github.com/apache/datafusion/pull/16234 Revert PR:
- https://github.com/apache/datafusion/pull/16307
I did confirm that reverting https://github.com/apache/datafusion/pull/16234 fixes the issue
We reverted the change in DF 48:
- https://github.com/apache/datafusion/pull/16307 We can focus on fixing it for real for DataFusion 49.0.0
FYI @suibianwanwank would you be willing to take this issue?
I also added this ticket to the list of things we need to do on DataFusion 49 prior to release
- https://github.com/apache/datafusion/issues/16235
FYI @suibianwanwank would you be willing to take this issue?
Sure, I'd be happy to take a look. Things have been a bit busy on my end, but I’ll review it over the weekend.
Thank you @suibianwanwank