datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

DISCUSSION: remove `CREATE EXTERNAL TABLE` syntax: `DELIMITER`, `WITH HEADER ROW` and `COMPRESSION`

Open alamb opened this issue 1 year ago • 6 comments

Is your feature request related to a problem or challenge?

CREATE EXTERNAL TABLE (see docs) has several syntax items that only make sense for certain formats, specifically

CREATE [UNBOUNDED] EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>[ (<column_definition>) ]
STORED AS <file_type>
[ WITH HEADER ROW ]                             <-- only makes sense for CSV
[ DELIMITER <char> ]                            <-- only makes sense for CSV, JSON
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ] <-- only makes sense for CSV, JSON
[ PARTITIONED BY (<column list>) ]
[ WITH ORDER (<ordered column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

Now, thanks to @metesynnada @devinjdangelo and others, we have a way to support arbitrary format specific options such as has_header:

CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')

However, the old inconsistent syntax is still supported, which causes additional code complexity as noted by @ozankabak in https://github.com/apache/datafusion/pull/10404#issuecomment-2099134059 and potential confusion with users (who don't know how to look for the options)

Describe the solution you'd like

  1. I think we should at least update the documentation so it sends people to the new preferred syntax (OPTIONS(..)).
  2. We should consider deprecating / removing the old CREATE TABLE syntax DELIMITER, WITH HEADER ROW and COMPRESSION

Describe alternatives you've considered

No response

Additional context

@metesynnada did some great work to get COPY to align to external table: https://github.com/apache/datafusion/pull/9604

@berkaysynnada has noticed issues related to inconsistency in the EXTERNAL TABLE syntax and config options https://github.com/apache/datafusion/issues/9945

alamb avatar May 07 '24 19:05 alamb

Thank you for creating this ticket to facilitate the discussion.

For the general audience: The good news is that we found a way to support the old syntax simultaneously with the new options syntax -- the bad news is that it results in code complexity/maintenance burden.

Basically, the aim of this ticket is to gather information on usage so that we can decide whether it is worth to have both syntaxes supported for a while -- or should we just pull the band-aid and remove the old syntax.

ozankabak avatar May 07 '24 20:05 ozankabak

I agree that it makes sense to "pull the band-aid" and remove the syntax from this code.

One place where I think this might get difficult is if an project that relies on DataFusion has some sort of user/external input that relies on this syntax, that the project itself doesn't have the ability to easily change. But the nice thing about DataFusion is its extensible, so I could imagine if someone has a strong requirement to keep the old syntax, we could write a custom parser that wraps DFParser (maybe lives in another repo, or in examples) that adds it back and just sets OPTIONS('format.has_header' 'true'). (See an example of a custom parser at https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs#L45)

phillipleblanc avatar May 08 '24 01:05 phillipleblanc

But the nice thing about DataFusion is its extensible, so I could imagine if someone has a strong requirement to keep the old syntax, we could write a custom parser that wraps DFParser (maybe lives in another repo, or in examples) that adds it back and just sets OPTIONS('format.has_header' 'true'). (See an example of a custom parser at https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs#L45)

Agreed, sounds like a reasonable way forward in case removing the old syntax creates challenges for certain users that can't change their SQL.

ozankabak avatar May 08 '24 07:05 ozankabak

IIRC, some of this original syntax came from a desire to suppor Hive SQL, but as @phillipleblanc said, if anyone needs this then they can add it back under a specific dialect.

andygrove avatar May 08 '24 12:05 andygrove

If there is consensus here, given we just branched for the 38 release, maybe we can remove support on main now as part of https://github.com/apache/datafusion/pull/10404 and let that sit on main for a while 🤔

alamb avatar May 08 '24 17:05 alamb

Sounds good. We will update the PR tomorrow 👍

ozankabak avatar May 08 '24 17:05 ozankabak