datafusion Support writing hive style partitioned files in `DataFrame::write` command

Is your feature request related to a problem or challenge?

@Omega359 asked on discord: https://discord.com/channels/885562378132000778/1166447479609376850/1207458257874984970

Q: Is there a way to write out a dataframe to parquet with hive-style partitioning without having to create a table provider? I am pretty sure that a ListingTableProvider or a custom table provider will work but that seems like a ton of config for this

Describe the solution you'd like

I would like to be able to use DataFrame::write_parquet and the other APIs to write partitioned files

I suggest adding the table_partition_cols from ListingOptions as one of the options on https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrameWriteOptions.html

So way to specify partition information would be as described on ListingOptions::with_table_partition_cols

So that would look something like

let options = DataFrameWriteOptions::new()
  .with_table_partition_cols(vec![
      ("col_a".to_string(), DataType::Utf8),
  ]);

// write the data frame to parquet
// producing files like
// /tmp/my_table/col_a=foo/12345.parquet (data with 'foo' in column a)
// ..
// /tmp/my_table/col_a=zoo/12345.parquet (data with 'zoo' in column a)
df.write_parquet("/tmp/my_table", &options, None).await?

Describe alternatives you've considered

No response

Additional context

Possibly related to https://github.com/apache/arrow-datafusion/issues/8493

Feb 15 '24 11:02 alamb

Dataframe::write_parquet and related methods use the COPY logical/ physical plans under the hood, so if we knock out #8493 this ticket should come almost for free.

Feb 15 '24 13:02 devinjdangelo

I went ahead and implemented this and #8493 in #9240. Let me know if it looks good to you @alamb .

Feb 15 '24 15:02 devinjdangelo

@devinjdangelo implemented the code in https://github.com/apache/arrow-datafusion/pull/9240

In order to close this ticket we just need to add test coverage for writing partitioned parquet in DataFrame::write_parquet

My suggestion is:

Move the existing tests https://github.com/apache/arrow-datafusion/blob/4d389c2590370d85bfe3af77f5243d5b40f5a222/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L2070 tests into the dataframe tests https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs
Add a new test in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs following the same model, to verify the parquet files were written:

The new test could basically do the same thing as the tests added in https://github.com/apache/arrow-datafusion/pull/9240/files#diff-b7d6c89870d082cac4ecc6de05f2ec393559327472fc4a846986f02c812f661fR34

write to a partitioned table
read back from the table to ensure all data went there
Read back from one of the partitions to ensure the data was actually partitioned

Feb 19 '24 07:02 alamb

datafusion datafusion copied to clipboard

Support writing hive style partitioned files in `DataFrame::write` command

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

datafusion
datafusion copied to clipboard