DISCUSSION: remove `CREATE EXTERNAL TABLE` syntax: `DELIMITER`, `WITH HEADER ROW` and `COMPRESSION`
Is your feature request related to a problem or challenge?
CREATE EXTERNAL TABLE (see docs) has several syntax items that only make sense for certain formats, specifically
CREATE [UNBOUNDED] EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>[ (<column_definition>) ]
STORED AS <file_type>
[ WITH HEADER ROW ] <-- only makes sense for CSV
[ DELIMITER <char> ] <-- only makes sense for CSV, JSON
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ] <-- only makes sense for CSV, JSON
[ PARTITIONED BY (<column list>) ]
[ WITH ORDER (<ordered column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>
Now, thanks to @metesynnada @devinjdangelo and others, we have a way to support arbitrary format specific options such as has_header:
CREATE EXTERNAL TABLE aggregate_simple (
c1 FLOAT NOT NULL,
c2 DOUBLE NOT NULL,
c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')
However, the old inconsistent syntax is still supported, which causes additional code complexity as noted by @ozankabak in https://github.com/apache/datafusion/pull/10404#issuecomment-2099134059 and potential confusion with users (who don't know how to look for the options)
Describe the solution you'd like
- I think we should at least update the documentation so it sends people to the new preferred syntax (
OPTIONS(..)). - We should consider deprecating / removing the old
CREATE TABLEsyntaxDELIMITER,WITH HEADER ROWandCOMPRESSION
Describe alternatives you've considered
No response
Additional context
@metesynnada did some great work to get COPY to align to external table: https://github.com/apache/datafusion/pull/9604
@berkaysynnada has noticed issues related to inconsistency in the EXTERNAL TABLE syntax and config options https://github.com/apache/datafusion/issues/9945
Thank you for creating this ticket to facilitate the discussion.
For the general audience: The good news is that we found a way to support the old syntax simultaneously with the new options syntax -- the bad news is that it results in code complexity/maintenance burden.
Basically, the aim of this ticket is to gather information on usage so that we can decide whether it is worth to have both syntaxes supported for a while -- or should we just pull the band-aid and remove the old syntax.
I agree that it makes sense to "pull the band-aid" and remove the syntax from this code.
One place where I think this might get difficult is if an project that relies on DataFusion has some sort of user/external input that relies on this syntax, that the project itself doesn't have the ability to easily change. But the nice thing about DataFusion is its extensible, so I could imagine if someone has a strong requirement to keep the old syntax, we could write a custom parser that wraps DFParser (maybe lives in another repo, or in examples) that adds it back and just sets OPTIONS('format.has_header' 'true'). (See an example of a custom parser at https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs#L45)
But the nice thing about DataFusion is its extensible, so I could imagine if someone has a strong requirement to keep the old syntax, we could write a custom parser that wraps DFParser (maybe lives in another repo, or in examples) that adds it back and just sets OPTIONS('format.has_header' 'true'). (See an example of a custom parser at https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs#L45)
Agreed, sounds like a reasonable way forward in case removing the old syntax creates challenges for certain users that can't change their SQL.
IIRC, some of this original syntax came from a desire to suppor Hive SQL, but as @phillipleblanc said, if anyone needs this then they can add it back under a specific dialect.
If there is consensus here, given we just branched for the 38 release, maybe we can remove support on main now as part of https://github.com/apache/datafusion/pull/10404 and let that sit on main for a while 🤔
Sounds good. We will update the PR tomorrow 👍