datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

`COPY ... PARTITIONED BY` with parquet causes "out of bounds" panic

Open samuelcolvin opened this issue 1 year ago • 3 comments

Describe the bug

While investigating #10709, I tried using datafusion CLI to require parquet files to a better size.

But I got a panic:

thread 'tokio-runtime-worker' panicked at /Users/samuel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-38.0.0/src/datasource/file_format/write/demux.rs:381:31:
index out of bounds: the len is 1 but the index is 1

To Reproduce

I can't share the file, but we have some parquet data with project_id and day columns (here both are interpreted as strings). I run the following:

RUST_BACKTRACE=1 datafusion-cli
DataFusion CLI v38.0.0
> CREATE EXTERNAL TABLE records
PARTITIONED BY (project_id, day) STORED AS PARQUET
LOCATION 'path/to/records/';
0 row(s) fetched. 
Elapsed 0.015 seconds.

> COPY records to 'path/to/records-big/' partitioned by (project_id, day) stored as parquet;
thread 'tokio-runtime-worker' panicked at /Users/samuel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-38.0.0/src/datasource/file_format/write/demux.rs:381:31:
index out of bounds: the len is 1 but the index is 1
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_bounds_check
   3: datafusion::datasource::file_format::write::demux::compute_take_arrays
   4: datafusion::datasource::file_format::write::demux::start_demuxer_task::{{closure}}
   5: tokio::runtime::task::core::Core<T,S>::poll
   6: tokio::runtime::task::harness::Harness<T,S>::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run
   9: tokio::runtime::context::scoped::Scoped<T>::set
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::scheduler::multi_thread::worker::run
  12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

If I remove partitioned by (project_id, day) it finishes fine.

Expected behavior

No response

Additional context

I also tried with v37.0.0 and got the same panic.

samuelcolvin avatar May 29 '24 11:05 samuelcolvin