datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Overwritten Format Configs by CreateExternalTable Options

Open berkaysynnada opened this issue 10 months ago • 5 comments

Describe the bug

https://github.com/apache/datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/core/src/datasource/listing_table_factory.rs#L67-L73

These lines of ListingTableFactory.create() overwrites the config options by cmd (CreateExternalTable options). Configs coming from OPTIONS('format... are discarded silently.

To Reproduce

statement ok
CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
WITH HEADER ROW

query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----

works as expected, but

statement ok
CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')

query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----

does not, since CREATE EXTERNAL TABLE does not have WITH HEADER ROW, which overwrites csv.has_header to false.

Expected behavior

The actual problem is that we can set the same settings by 2 different commands now, and one of them is silently chosen. Their default values are also different (CreateExternalTable's false, CsvOptions' true). We can set both of them as false initially. Then if one of them is true, then we expect a header. If WITH HEADER ROW exists and 'format.has_header' 'false', we can give an error.

Additional context

No response

berkaysynnada avatar Apr 04 '24 11:04 berkaysynnada

CreateExternalTable has some CSV specific fields:

/// Whether the CSV file contains a header
pub has_header: bool,
/// Delimiter for CSV
pub delimiter: char,

After the new way of setting the configs https://github.com/apache/arrow-datafusion/pull/9382, I believe we should eliminate them now. CreateExternalTable should have fields applying all type of formats. Format specific options are given with OPTIONS() part.

berkaysynnada avatar Apr 05 '24 10:04 berkaysynnada

Is there any way to retain the WITH HEADER ROW syntax for backwards compatibility? For example, in the parser we could distinguish between WITH HEADER ROW, WITH NO HEADER ROW and not specified.

alamb avatar Apr 06 '24 11:04 alamb

Is there any way to retain the WITH HEADER ROW syntax for backwards compatibility? For example, in the parser we could distinguish between WITH HEADER ROW, WITH NO HEADER ROW and not specified.

That might be a good idea. Converting has_header: bool to has_header: Option<bool> in both CsvOptions and CreateExternalTable will solve the problem, and sustain the backwards compatibility. We can sense the conflicting options and keep the default behavior.

berkaysynnada avatar Apr 15 '24 08:04 berkaysynnada

Given how long WITH HEADER ROW has been around, I think removing it now would be really disruptive on users

alamb avatar Apr 15 '24 10:04 alamb

Sorry if my comments here were misinterpreted to mean that DELIMITER and COMPRESSION were not important to maintain. I think all the existing syntax should be maintained until we have a larger conversation about removing them

I am happy to help facilitate such a conversation if that would help

alamb avatar May 07 '24 18:05 alamb