datafusion
datafusion copied to clipboard
`COPY ... PARTITIONED BY` with parquet causes "out of bounds" panic
Describe the bug
While investigating #10709, I tried using datafusion CLI to require parquet files to a better size.
But I got a panic:
thread 'tokio-runtime-worker' panicked at /Users/samuel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-38.0.0/src/datasource/file_format/write/demux.rs:381:31:
index out of bounds: the len is 1 but the index is 1
To Reproduce
I can't share the file, but we have some parquet data with project_id and day columns (here both are interpreted as strings). I run the following:
RUST_BACKTRACE=1 datafusion-cli
DataFusion CLI v38.0.0
> CREATE EXTERNAL TABLE records
PARTITIONED BY (project_id, day) STORED AS PARQUET
LOCATION 'path/to/records/';
0 row(s) fetched.
Elapsed 0.015 seconds.
> COPY records to 'path/to/records-big/' partitioned by (project_id, day) stored as parquet;
thread 'tokio-runtime-worker' panicked at /Users/samuel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-38.0.0/src/datasource/file_format/write/demux.rs:381:31:
index out of bounds: the len is 1 but the index is 1
stack backtrace:
0: _rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic_bounds_check
3: datafusion::datasource::file_format::write::demux::compute_take_arrays
4: datafusion::datasource::file_format::write::demux::start_demuxer_task::{{closure}}
5: tokio::runtime::task::core::Core<T,S>::poll
6: tokio::runtime::task::harness::Harness<T,S>::poll
7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
8: tokio::runtime::scheduler::multi_thread::worker::Context::run
9: tokio::runtime::context::scoped::Scoped<T>::set
10: tokio::runtime::context::runtime::enter_runtime
11: tokio::runtime::scheduler::multi_thread::worker::run
12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
13: tokio::runtime::task::core::Core<T,S>::poll
14: tokio::runtime::task::harness::Harness<T,S>::poll
15: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
If I remove partitioned by (project_id, day) it finishes fine.
Expected behavior
No response
Additional context
I also tried with v37.0.0 and got the same panic.