datafusion Panic when reading from partitioned datasets with columns that have `'` in them

Describe the bug

There is a bug when reading from partitioned tables that have commas in their names

Here is the test https://github.com/apache/arrow-datafusion/blob/b2a04519da97c2ff81789ef41dd652870794a73a/datafusion/sqllogictest/test_files/copy.slt#L109

To Reproduce

Run this script

-- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');

-- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

-- This panics
select * from validate_partitioned_escape_quote;

Here is an example:

❯ -- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');

0 rows in set. Query took 0.008 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.009 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.029 seconds.

❯ -- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

0 rows in set. Query took 0.004 seconds.

❯ -- This panics
select * from validate_partitioned_escape_quote;

thread 'thread 'tokio-runtime-workertokio-runtime-worker' panicked at ' panicked at /Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs::248:thread '54248:
:tokio-runtime-workerindex out of bounds: the len is 0 but the index is 054' panicked at
/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:248:
:index out of bounds: the len is 0 but the index is 054
:
index out of bounds: the len is 0 but the index is 0
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_bounds_check
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:208:5
   3: datafusion::datasource::physical_plan::file_scan_config::PartitionColumnProjector::project
   4: <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
   5: datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}}
   6: tokio::runtime::task::core::Core<T,S>::poll
   7: tokio::runtime::task::harness::Harness<T,S>::poll
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   9: tokio::runtime::scheduler::multi_thread::worker::Context::run
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::scheduler::multi_thread::worker::run
  12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Expected behavior

Note the data is written correctly

andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ find /tmp/escape_quote
/tmp/escape_quote
/tmp/escape_quote/'test2'=x
/tmp/escape_quote/'test2'=x/'test3'=aa
/tmp/escape_quote/'test2'=x/'test3'=aa/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=y
/tmp/escape_quote/'test2'=y/'test3'=bb
/tmp/escape_quote/'test2'=y/'test3'=bb/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=z
/tmp/escape_quote/'test2'=z/'test3'=cc
/tmp/escape_quote/'test2'=z/'test3'=cc/3zMw255TXFQxId14.csv

andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ cat /tmp/escape_quote/\'test2\'\=x/\'test3\'\=aa/3zMw255TXFQxId14.csv
'test'
a

Additional context

@devinjdangelo found this in https://github.com/apache/arrow-datafusion/pull/9240

Feb 19 '24 07:02 alamb

I tried searching documentation of various engines to see if ' is allowed in partition columns. I didn't find anything concrete.

However, I tried the equivalent example in DuckDB and it does work (see below). It would probably be best to tighten up our parsing of non standard column names. It may be easier to make this robust by extending sqlparser-rs upstream. The recent discussion on the mailing list is relevant https://lists.apache.org/thread/q80j49poyg99x2c01900312qz7ps9wgp

devinjd@devinjd$ ./duckdb 
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
D insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
D select * from test;
┌─────────┬─────────┬─────────┐
│ 'test'  │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a       │ x       │ aa      │
│ b       │ y       │ bb      │
│ c       │ z       │ cc      │
└─────────┴─────────┴─────────┘
D copy test to '/tmp/escape_quote' (format csv, partition_by ('''test2''','''test3'''));
D select * from read_csv('/tmp/escape_quote/*/*/*.csv', hive_partitioning=1, header=true);
┌─────────┬─────────┬─────────┐
│ 'test'  │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a       │ x       │ aa      │
│ b       │ y       │ bb      │
│ c       │ z       │ cc      │
└─────────┴─────────┴─────────┘

Feb 19 '24 14:02 devinjdangelo

Another case:

DataFusion CLI v37.0.0
❯ create external table test123(a string, ```a=b``` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
0 row(s) fetched.
Elapsed 0.002 seconds.

❯ insert into test123 select 'a', 'b';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.

❯ select * from test123;
thread 'main' panicked at /home/jeffrey/Code/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:261:54:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
datafusion-cli$

Files exist on disk:

datafusion-cli$ ll /tmp/test123/
total 0
drwxr-xr-x   2 jeffrey jeffrey   60 Apr  5 20:43 '%60a=b%60=b'/
drwxrwxrwt 119 root    root    2.6K Apr  5 20:43  ../
drwxr-xr-x   3 jeffrey jeffrey   60 Apr  5 20:43  ./
datafusion-cli$ ll /tmp/test123/\%60a=b\%60=b/
total 4.0K
-rw-r--r-- 1 jeffrey jeffrey 282 Apr  5 20:43 33T7kTcVyecaVk07.parquet
drwxr-xr-x 3 jeffrey jeffrey  60 Apr  5 20:43 ../
drwxr-xr-x 2 jeffrey jeffrey  60 Apr  5 20:43 ./
datafusion-cli$

Note I had to get cheeky with the column and partition names, as just this didn't work:

DataFusion CLI v37.0.0
❯ create external table test123(a string, `a=b` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
Arrow error: Schema error: Unable to get field named "`a=b`". Valid fields: ["a", "a=b"]

Apr 05 '24 09:04 Jefffrey

datafusion datafusion copied to clipboard

Panic when reading from partitioned datasets with columns that have `'` in them

Describe the bug

To Reproduce

Expected behavior

Additional context

datafusion
datafusion copied to clipboard