datafusion
datafusion copied to clipboard
Simplify Format Options
Which issue does this PR close?
Closes #9945.
Rationale for this change
https://github.com/apache/datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/core/src/datasource/listing_table_factory.rs#L67-L73
These lines of ListingTableFactory.create()
overwrites the config options by cmd
(CreateExternalTable options). Configs coming from OPTIONS('format...
are discarded silently.
statement ok
CREATE EXTERNAL TABLE aggregate_simple (
c1 FLOAT NOT NULL,
c2 DOUBLE NOT NULL,
c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
WITH HEADER ROW
query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----
works as expected, but
statement ok
CREATE EXTERNAL TABLE aggregate_simple (
c1 FLOAT NOT NULL,
c2 DOUBLE NOT NULL,
c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')
query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----
does not, since CREATE EXTERNAL TABLE
does not have WITH HEADER ROW
, which overwrites csv.has_header
to false
.
To summarize:
CreateExternalTable
has 3 format specific fields:
-
has_header
-
delimiter
-
file_compression_type
After https://github.com/apache/datafusion/pull/9382,CreateExternalTable.options
does store these format specific configurations.
What changes are included in this PR?
CreateExternalTable.delimiter
and CreateExternalTable.file_compression_type
are removed, and they are needed to be given from options, such as
OPTIONS(
'format.delimiter' ',',
'format.compression' 'gzip')
~For the has_header
config, as @alamb https://github.com/apache/datafusion/issues/9945#issuecomment-2056552026 suggests, we can sustain its usage as WITH HEADER ROW
from CreateExternalTable's scope. I refactor the code such that~
(WITH HEADER ROW) + (format.has_header = true) => OK with header (WITH HEADER ROW) + (format.has_header = false) => Error (WITH HEADER ROW) + (format.has_header = not specified) => OK with header () + (format.has_header = true) => OK with header () + (format.has_header = false) => Ok without header () + (format.has_header = not_specified) => Ok without header
~However in the long term, I think removing also WITH HEADER ROW
option will be a better design and help having more clear UX.~
WITH HEADER ROW is also removed to OPTIONS tab.
Are these changes tested?
Yes.
Are there any user-facing changes?
Users access format specific configurations (delimiter and compression type) via OPTIONS()
command.
Thanks for taking a look. Do you suggest having a similar mechanism for DELIMITER
and COMPRESSION
like we did for WITH HEADER ROW
? We can do that, but it will increase code complexity and we will need maintain that code until we finally remove them to arrive at the consistent syntax we aim.
I fear the longer we support the current syntax, the harder it will be to fix it. We may end up being stuck with CSV-related options living at the top level.
Start a new discussion / ticket about removing / deprecating the WITH HEADER ROW / COMPRESSION, DELIMITER syntax support.
Makes sense, let's do this
Thanks for taking a look. Do you suggest having a similar mechanism for DELIMITER and COMPRESSION like we did for WITH HEADER ROW?
Yes, that is my suggestion
We can do that, but it will increase code complexity and we will need maintain that code until we finally remove them to arrive at the consistent syntax we aim.
I agree it will make the code more complex
I fear the longer we support the current syntax, the harder it will be to fix it. We may end up being stuck with CSV-related options living at the top level.
I agree -- I will file a follow on ticket discuss removing the options
I agree -- I will file a follow on ticket discuss removing the options
Filed https://github.com/apache/datafusion/issues/10414 to discus removing these options
This PR is now ready for review. By employing that method, users can choose to maintain the existing behavior if desired.
I plan to review this again later today. Thanks @berkaysynnada and @ozankabak
I also updated this PR's title to reflect the user visible changes and added the api-change
label
Thank you. We will address your review feedback and then merge afterwards. 🚀
Thanks again for this work @berkaysynnada and the guidance @ozankabak
Hey, this changed the behavior of writing and reading CSVs to be slightly inconsistent: now writing defaults to not writing a header, while reading defaults to having a header. Previously both defaulted to having a header, ie. the write side has flipped.
I noticed as our test started failing - a simplified version below, the following worked previously, after this commit it fails:
#[tokio::test]
async fn write_then_read_csv() -> Result<()> {
let test_path = test_file_path("csv/");
let ctx = SessionContext::new();
let df = ctx.sql("SELECT 1 AS col").await?;
assert_eq!(df.to_owned().count().await?, 1);
df.write_csv(&test_path, DataFrameWriteOptions::default(), None).await?;
let df = ctx.read_csv(&test_path, CsvReadOptions::default()).await?;
assert_eq!(df.count().await?, 1);
Ok(())
}
but can be fixed by adding Some(CsvOptions::default().with_has_header(true))
to the write_csv (restoring old behavior), or alternatively changing read_csv to have CsvReadOptions::default().has_header(false)
to make it match the write.
Doesn't really matter to me how it works, though I find the mismatch somewhat surprising - but mainly I just wanted to flag this in case this was not expected/intentional change.
Hmm, this shouldn't be the case. The logic is supposed to be the following for both reads and writes:
- Use the explicitly specified header flag, if it exists.
- Otherwise, consult the session config and use the default value there.
If that is not so, we should open and issue and fix. Thanks for letting us know!
Doesn't really matter to me how it works, though I find the mismatch somewhat surprising - but mainly I just wanted to flag this in case this was not expected/intentional change.
Thanks for reporting that. It's surprising that this issue hasn't emerged before. CatalogOptions
's has_header value was false as default from the beginning, which is the write options in that case. However, reader's is true:
https://github.com/apache/datafusion/blob/09dde27be39ad054f85dfb5c37b7468a3f68d652/datafusion/core/src/datasource/file_format/options.rs#L90
I believe we should also set it as false while creating it.