datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Simplify Format Options

Open berkaysynnada opened this issue 9 months ago • 4 comments

Which issue does this PR close?

Closes #9945.

Rationale for this change

https://github.com/apache/datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/core/src/datasource/listing_table_factory.rs#L67-L73

These lines of ListingTableFactory.create() overwrites the config options by cmd (CreateExternalTable options). Configs coming from OPTIONS('format... are discarded silently.

statement ok
CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
WITH HEADER ROW

query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----

works as expected, but

statement ok
CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')

query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----

does not, since CREATE EXTERNAL TABLE does not have WITH HEADER ROW, which overwrites csv.has_header to false.

To summarize: CreateExternalTable has 3 format specific fields:

  1. has_header
  2. delimiter
  3. file_compression_type After https://github.com/apache/datafusion/pull/9382, CreateExternalTable.options does store these format specific configurations.

What changes are included in this PR?

CreateExternalTable.delimiter and CreateExternalTable.file_compression_type are removed, and they are needed to be given from options, such as

OPTIONS(
    'format.delimiter' ',',
    'format.compression' 'gzip')

~For the has_header config, as @alamb https://github.com/apache/datafusion/issues/9945#issuecomment-2056552026 suggests, we can sustain its usage as WITH HEADER ROW from CreateExternalTable's scope. I refactor the code such that~

(WITH HEADER ROW) + (format.has_header = true) => OK with header (WITH HEADER ROW) + (format.has_header = false) => Error (WITH HEADER ROW) + (format.has_header = not specified) => OK with header () + (format.has_header = true) => OK with header () + (format.has_header = false) => Ok without header () + (format.has_header = not_specified) => Ok without header

~However in the long term, I think removing also WITH HEADER ROW option will be a better design and help having more clear UX.~

WITH HEADER ROW is also removed to OPTIONS tab.

Are these changes tested?

Yes.

Are there any user-facing changes?

Users access format specific configurations (delimiter and compression type) via OPTIONS() command.

berkaysynnada avatar May 07 '24 11:05 berkaysynnada

Thanks for taking a look. Do you suggest having a similar mechanism for DELIMITER and COMPRESSION like we did for WITH HEADER ROW? We can do that, but it will increase code complexity and we will need maintain that code until we finally remove them to arrive at the consistent syntax we aim.

I fear the longer we support the current syntax, the harder it will be to fix it. We may end up being stuck with CSV-related options living at the top level.

Start a new discussion / ticket about removing / deprecating the WITH HEADER ROW / COMPRESSION, DELIMITER syntax support.

Makes sense, let's do this

ozankabak avatar May 07 '24 19:05 ozankabak

Thanks for taking a look. Do you suggest having a similar mechanism for DELIMITER and COMPRESSION like we did for WITH HEADER ROW?

Yes, that is my suggestion

We can do that, but it will increase code complexity and we will need maintain that code until we finally remove them to arrive at the consistent syntax we aim.

I agree it will make the code more complex

I fear the longer we support the current syntax, the harder it will be to fix it. We may end up being stuck with CSV-related options living at the top level.

I agree -- I will file a follow on ticket discuss removing the options

alamb avatar May 07 '24 19:05 alamb

I agree -- I will file a follow on ticket discuss removing the options

Filed https://github.com/apache/datafusion/issues/10414 to discus removing these options

alamb avatar May 07 '24 19:05 alamb

This PR is now ready for review. By employing that method, users can choose to maintain the existing behavior if desired.

berkaysynnada avatar May 09 '24 15:05 berkaysynnada

I plan to review this again later today. Thanks @berkaysynnada and @ozankabak

alamb avatar May 10 '24 14:05 alamb

I also updated this PR's title to reflect the user visible changes and added the api-change label

alamb avatar May 10 '24 19:05 alamb

Thank you. We will address your review feedback and then merge afterwards. 🚀

ozankabak avatar May 11 '24 16:05 ozankabak

Thanks again for this work @berkaysynnada and the guidance @ozankabak

alamb avatar May 13 '24 10:05 alamb

Hey, this changed the behavior of writing and reading CSVs to be slightly inconsistent: now writing defaults to not writing a header, while reading defaults to having a header. Previously both defaulted to having a header, ie. the write side has flipped.

I noticed as our test started failing - a simplified version below, the following worked previously, after this commit it fails:

#[tokio::test]
async fn write_then_read_csv() -> Result<()> {
    let test_path = test_file_path("csv/");

    let ctx = SessionContext::new();

    let df = ctx.sql("SELECT 1 AS col").await?;
    assert_eq!(df.to_owned().count().await?, 1);
    df.write_csv(&test_path, DataFrameWriteOptions::default(), None).await?;

    let df = ctx.read_csv(&test_path, CsvReadOptions::default()).await?;
    assert_eq!(df.count().await?, 1);

    Ok(())
}

but can be fixed by adding Some(CsvOptions::default().with_has_header(true)) to the write_csv (restoring old behavior), or alternatively changing read_csv to have CsvReadOptions::default().has_header(false) to make it match the write.

Doesn't really matter to me how it works, though I find the mismatch somewhat surprising - but mainly I just wanted to flag this in case this was not expected/intentional change.

Blizzara avatar May 31 '24 13:05 Blizzara

Hmm, this shouldn't be the case. The logic is supposed to be the following for both reads and writes:

  • Use the explicitly specified header flag, if it exists.
  • Otherwise, consult the session config and use the default value there.

If that is not so, we should open and issue and fix. Thanks for letting us know!

ozankabak avatar May 31 '24 13:05 ozankabak

Doesn't really matter to me how it works, though I find the mismatch somewhat surprising - but mainly I just wanted to flag this in case this was not expected/intentional change.

Thanks for reporting that. It's surprising that this issue hasn't emerged before. CatalogOptions's has_header value was false as default from the beginning, which is the write options in that case. However, reader's is true: https://github.com/apache/datafusion/blob/09dde27be39ad054f85dfb5c37b7468a3f68d652/datafusion/core/src/datasource/file_format/options.rs#L90

I believe we should also set it as false while creating it.

berkaysynnada avatar May 31 '24 13:05 berkaysynnada