sqlparser-rs Add support for DataFusion specific statements

Context

arrow-datafusion currently implements its own parser, DFParser which wraps the parser in this crate in order to parse some DataFusion specific statements.

DataFusion issue: https://github.com/apache/arrow-datafusion/issues/4808

Aiming to upstream this functionality into this crate to remove the parsing code from DataFusion.

Statements

Currently two custom statements that DataFusion parses: COPY TO ... and CREATE EXTERNAL TABLE ...

There is EXPLAIN too, but this is only there to support doing EXPLAIN for those new extensions

COPY TO

Syntax:

COPY <table_name> | (<query>)
TO '<destination>'
[ ( key1 value1, key2 value2) ]

Examples:

COPY lineitem TO '/path/to/lineitem.parquet' (format parquet, partitions 16);
COPY (SELECT * FROM lineitem) TO '/path/to/lineitem.parquet';

This is extremely similar to the COPY statement from PostgreSQL that sqlparser-rs already supports, with a few key differences:

Doesn't support optional WITH keyword
Only supports COPY TO and not COPY FROM
Doesn't support column list when source is table
Only supports target as string literal, not PROGRAM or STDOUT
Options list doesn't constrain keys to a defined set

Points 2, 3 & 4 are non-issues since can inspect the Statement AST to check if want to support this:

https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/ast/mod.rs#L1463-L1479

Point 1 is a minor issue as the Statement AST above doesn't specify if there was a WITH keyword found when parsing, but at the same time DataFusion could just accept this new syntax since this optional keyword has minimal impact.

Point 5 is the major issue, as would either need to modify the fields of the existing Copy enum or add a new one specific for DataFusion, since it is a requirement that the keys cannot be constrained (would be parsed as String).

CREATE EXTERNAL TABLE

Syntax:

CREATE [ UNBOUNDED ] EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>[ (<column_definition>) ]
STORED AS <file_type>
[ WITH HEADER ROW ]
[ DELIMITER <char> ]
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ WITH ORDER (<ordered column list>)
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

<column_definition> := (<column_name> <data_type>, ...)

<column_list> := (<column_name>, ...)

<ordered_column_list> := (<column_name> <sort_clause>, ...)

<key_value_list> := (<literal> <literal, <literal> <literal>, ...)

Example:

CREATE UNBOUNDED EXTERNAL TABLE IF NOT EXISTS
kumachan (c1 int)
STORED AS CSV
WITH HEADER ROW
DELIMITER ','
COMPRESSION TYPE zstd
PARTITIONED BY (c1)
WITH ORDER (c1 asc)
OPTIONS (
  'k1' 'v1',
  'k2' 'v2'
)
LOCATION '/file'

This seems to vary heavily from the existing support for CreateTable:

https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/ast/mod.rs#L1561-L1604

Varies in the keyword parsing, such as WITH HEADER ROW and WITH ORDER

So might need a new statement for this? Or could try to retrofit onto the existing CreateTable statement.

Dialect

Also will need a new DataFusion dialect to support the above customization (e.g. to be able to toggle between previous/default behaviour for COPY TO to parse options as predefined keys, or as generic).

Alternative

Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.

I see there is this dialect function:

https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/dialect/mod.rs#L171-L175

But this doesn't allow for custom statements.

I'm unsure what this could look like, but worth a thought.

Jan 02 '24 01:01 Jefffrey

I plan to take this on, probably starting first with COPY TO

Jan 02 '24 01:01 Jefffrey

Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.

I think this would be the best outcome -- if possible. I am not quite sure if we should be adding DataFusion specific things into this parser 🤔

Making it easier to extend this parser with custom hooks however sounds like a great idea to me

Jan 02 '24 17:01 alamb

Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.

I think this would be the best outcome -- if possible. I am not quite sure if we should be adding DataFusion specific things into this parser 🤔

Making it easier to extend this parser with custom hooks however sounds like a great idea to me

Would that suggest that https://github.com/apache/arrow-datafusion/issues/4808 would instead be focused on reducing the code of DFParser in DataFusion and not eliminate it entirely?

Jan 02 '24 21:01 Jefffrey

Would that suggest that apache/arrow-datafusion#4808 would instead be focused on reducing the code of DFParser in DataFusion and not eliminate it entirely?

I agree reducing the code is a better goal -- I also left some additional comments here: https://github.com/apache/arrow-datafusion/issues/4808#issuecomment-1874631631

Jan 02 '24 22:01 alamb