datafusion
datafusion copied to clipboard
Panic when reading from partitioned datasets with columns that have `'` in them
Describe the bug
There is a bug when reading from partitioned tables that have commas in their names
Here is the test https://github.com/apache/arrow-datafusion/blob/b2a04519da97c2ff81789ef41dd652870794a73a/datafusion/sqllogictest/test_files/copy.slt#L109
To Reproduce
Run this script
-- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');
-- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");
-- This panics
select * from validate_partitioned_escape_quote;
Here is an example:
❯ -- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');
0 rows in set. Query took 0.008 seconds.
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.009 seconds.
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.029 seconds.
❯ -- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");
0 rows in set. Query took 0.004 seconds.
❯ -- This panics
select * from validate_partitioned_escape_quote;
thread 'thread 'tokio-runtime-workertokio-runtime-worker' panicked at ' panicked at /Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs::248:thread '54248:
:tokio-runtime-workerindex out of bounds: the len is 0 but the index is 054' panicked at
/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:248:
:index out of bounds: the len is 0 but the index is 054
:
index out of bounds: the len is 0 but the index is 0
stack backtrace:
0: rust_begin_unwind
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
1: core::panicking::panic_fmt
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
2: core::panicking::panic_bounds_check
at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:208:5
3: datafusion::datasource::physical_plan::file_scan_config::PartitionColumnProjector::project
4: <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
5: datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}}
6: tokio::runtime::task::core::Core<T,S>::poll
7: tokio::runtime::task::harness::Harness<T,S>::poll
8: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
9: tokio::runtime::scheduler::multi_thread::worker::Context::run
10: tokio::runtime::context::runtime::enter_runtime
11: tokio::runtime::scheduler::multi_thread::worker::run
12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
13: tokio::runtime::task::core::Core<T,S>::poll
14: tokio::runtime::task::harness::Harness<T,S>::poll
15: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Expected behavior
Note the data is written correctly
andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ find /tmp/escape_quote
/tmp/escape_quote
/tmp/escape_quote/'test2'=x
/tmp/escape_quote/'test2'=x/'test3'=aa
/tmp/escape_quote/'test2'=x/'test3'=aa/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=y
/tmp/escape_quote/'test2'=y/'test3'=bb
/tmp/escape_quote/'test2'=y/'test3'=bb/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=z
/tmp/escape_quote/'test2'=z/'test3'=cc
/tmp/escape_quote/'test2'=z/'test3'=cc/3zMw255TXFQxId14.csv
andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ cat /tmp/escape_quote/\'test2\'\=x/\'test3\'\=aa/3zMw255TXFQxId14.csv
'test'
a
Additional context
@devinjdangelo found this in https://github.com/apache/arrow-datafusion/pull/9240
I tried searching documentation of various engines to see if ' is allowed in partition columns. I didn't find anything concrete.
However, I tried the equivalent example in DuckDB and it does work (see below). It would probably be best to tighten up our parsing of non standard column names. It may be easier to make this robust by extending sqlparser-rs upstream. The recent discussion on the mailing list is relevant https://lists.apache.org/thread/q80j49poyg99x2c01900312qz7ps9wgp
devinjd@devinjd$ ./duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
D insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
D select * from test;
┌─────────┬─────────┬─────────┐
│ 'test' │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a │ x │ aa │
│ b │ y │ bb │
│ c │ z │ cc │
└─────────┴─────────┴─────────┘
D copy test to '/tmp/escape_quote' (format csv, partition_by ('''test2''','''test3'''));
D select * from read_csv('/tmp/escape_quote/*/*/*.csv', hive_partitioning=1, header=true);
┌─────────┬─────────┬─────────┐
│ 'test' │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a │ x │ aa │
│ b │ y │ bb │
│ c │ z │ cc │
└─────────┴─────────┴─────────┘
Another case:
DataFusion CLI v37.0.0
❯ create external table test123(a string, ```a=b``` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
0 row(s) fetched.
Elapsed 0.002 seconds.
❯ insert into test123 select 'a', 'b';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.
❯ select * from test123;
thread 'main' panicked at /home/jeffrey/Code/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:261:54:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
datafusion-cli$
Files exist on disk:
datafusion-cli$ ll /tmp/test123/
total 0
drwxr-xr-x 2 jeffrey jeffrey 60 Apr 5 20:43 '%60a=b%60=b'/
drwxrwxrwt 119 root root 2.6K Apr 5 20:43 ../
drwxr-xr-x 3 jeffrey jeffrey 60 Apr 5 20:43 ./
datafusion-cli$ ll /tmp/test123/\%60a=b\%60=b/
total 4.0K
-rw-r--r-- 1 jeffrey jeffrey 282 Apr 5 20:43 33T7kTcVyecaVk07.parquet
drwxr-xr-x 3 jeffrey jeffrey 60 Apr 5 20:43 ../
drwxr-xr-x 2 jeffrey jeffrey 60 Apr 5 20:43 ./
datafusion-cli$
Note I had to get cheeky with the column and partition names, as just this didn't work:
DataFusion CLI v37.0.0
❯ create external table test123(a string, `a=b` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
Arrow error: Schema error: Unable to get field named "`a=b`". Valid fields: ["a", "a=b"]