Add support for DataFusion specific statements
Context
arrow-datafusion currently implements its own parser, DFParser which wraps the parser in this crate in order to parse some DataFusion specific statements.
DataFusion issue: https://github.com/apache/arrow-datafusion/issues/4808
Aiming to upstream this functionality into this crate to remove the parsing code from DataFusion.
Statements
Currently two custom statements that DataFusion parses: COPY TO ... and CREATE EXTERNAL TABLE ...
- There is
EXPLAINtoo, but this is only there to support doingEXPLAINfor those new extensions
COPY TO
Syntax:
COPY <table_name> | (<query>)
TO '<destination>'
[ ( key1 value1, key2 value2) ]
Examples:
COPY lineitem TO '/path/to/lineitem.parquet' (format parquet, partitions 16);
COPY (SELECT * FROM lineitem) TO '/path/to/lineitem.parquet';
This is extremely similar to the COPY statement from PostgreSQL that sqlparser-rs already supports, with a few key differences:
- Doesn't support optional
WITHkeyword - Only supports
COPY TOand notCOPY FROM - Doesn't support column list when source is table
- Only supports target as string literal, not PROGRAM or STDOUT
- Options list doesn't constrain keys to a defined set
Points 2, 3 & 4 are non-issues since can inspect the Statement AST to check if want to support this:
https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/ast/mod.rs#L1463-L1479
Point 1 is a minor issue as the Statement AST above doesn't specify if there was a WITH keyword found when parsing, but at the same time DataFusion could just accept this new syntax since this optional keyword has minimal impact.
Point 5 is the major issue, as would either need to modify the fields of the existing Copy enum or add a new one specific for DataFusion, since it is a requirement that the keys cannot be constrained (would be parsed as String).
CREATE EXTERNAL TABLE
Syntax:
CREATE [ UNBOUNDED ] EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>[ (<column_definition>) ]
STORED AS <file_type>
[ WITH HEADER ROW ]
[ DELIMITER <char> ]
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ WITH ORDER (<ordered column list>)
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>
<column_definition> := (<column_name> <data_type>, ...)
<column_list> := (<column_name>, ...)
<ordered_column_list> := (<column_name> <sort_clause>, ...)
<key_value_list> := (<literal> <literal, <literal> <literal>, ...)
Example:
CREATE UNBOUNDED EXTERNAL TABLE IF NOT EXISTS
kumachan (c1 int)
STORED AS CSV
WITH HEADER ROW
DELIMITER ','
COMPRESSION TYPE zstd
PARTITIONED BY (c1)
WITH ORDER (c1 asc)
OPTIONS (
'k1' 'v1',
'k2' 'v2'
)
LOCATION '/file'
This seems to vary heavily from the existing support for CreateTable:
https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/ast/mod.rs#L1561-L1604
- Varies in the keyword parsing, such as
WITH HEADER ROWandWITH ORDER
So might need a new statement for this? Or could try to retrofit onto the existing CreateTable statement.
Dialect
Also will need a new DataFusion dialect to support the above customization (e.g. to be able to toggle between previous/default behaviour for COPY TO to parse options as predefined keys, or as generic).
Alternative
Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.
I see there is this dialect function:
https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/dialect/mod.rs#L171-L175
But this doesn't allow for custom statements.
I'm unsure what this could look like, but worth a thought.
I plan to take this on, probably starting first with COPY TO
Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.
I think this would be the best outcome -- if possible. I am not quite sure if we should be adding DataFusion specific things into this parser 🤔
Making it easier to extend this parser with custom hooks however sounds like a great idea to me
Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.
I think this would be the best outcome -- if possible. I am not quite sure if we should be adding DataFusion specific things into this parser 🤔
Making it easier to extend this parser with custom hooks however sounds like a great idea to me
Would that suggest that https://github.com/apache/arrow-datafusion/issues/4808 would instead be focused on reducing the code of DFParser in DataFusion and not eliminate it entirely?
Would that suggest that apache/arrow-datafusion#4808 would instead be focused on reducing the code of
DFParserin DataFusion and not eliminate it entirely?
I agree reducing the code is a better goal -- I also left some additional comments here: https://github.com/apache/arrow-datafusion/issues/4808#issuecomment-1874631631