datafusion
datafusion copied to clipboard
Overwritten Format Configs by CreateExternalTable Options
Describe the bug
https://github.com/apache/datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/core/src/datasource/listing_table_factory.rs#L67-L73
These lines of ListingTableFactory.create()
overwrites the config options by cmd (CreateExternalTable options). Configs coming from OPTIONS('format...
are discarded silently.
To Reproduce
statement ok
CREATE EXTERNAL TABLE aggregate_simple (
c1 FLOAT NOT NULL,
c2 DOUBLE NOT NULL,
c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
WITH HEADER ROW
query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----
works as expected, but
statement ok
CREATE EXTERNAL TABLE aggregate_simple (
c1 FLOAT NOT NULL,
c2 DOUBLE NOT NULL,
c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')
query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----
does not, since CREATE EXTERNAL TABLE
does not have WITH HEADER ROW
, which overwrites csv.has_header
to false
.
Expected behavior
The actual problem is that we can set the same settings by 2 different commands now, and one of them is silently chosen. Their default values are also different (CreateExternalTable's false, CsvOptions' true). We can set both of them as false initially. Then if one of them is true, then we expect a header. If WITH HEADER ROW
exists and 'format.has_header' 'false'
, we can give an error.
Additional context
No response
CreateExternalTable
has some CSV specific fields:
/// Whether the CSV file contains a header
pub has_header: bool,
/// Delimiter for CSV
pub delimiter: char,
After the new way of setting the configs https://github.com/apache/arrow-datafusion/pull/9382, I believe we should eliminate them now. CreateExternalTable
should have fields applying all type of formats. Format specific options are given with OPTIONS()
part.
Is there any way to retain the WITH HEADER ROW
syntax for backwards compatibility? For example, in the parser we could distinguish between WITH HEADER ROW
, WITH NO HEADER ROW
and not specified.
Is there any way to retain the
WITH HEADER ROW
syntax for backwards compatibility? For example, in the parser we could distinguish betweenWITH HEADER ROW
,WITH NO HEADER ROW
and not specified.
That might be a good idea. Converting has_header: bool
to has_header: Option<bool>
in both CsvOptions
and CreateExternalTable
will solve the problem, and sustain the backwards compatibility. We can sense the conflicting options and keep the default behavior.
Given how long WITH HEADER ROW
has been around, I think removing it now would be really disruptive on users
Sorry if my comments here were misinterpreted to mean that DELIMITER
and COMPRESSION
were not important to maintain. I think all the existing syntax should be maintained until we have a larger conversation about removing them
I am happy to help facilitate such a conversation if that would help