datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support writing hive style partitioned files in `DataFrame::write` command

Open alamb opened this issue 1 year ago • 3 comments

Is your feature request related to a problem or challenge?

@Omega359 asked on discord: https://discord.com/channels/885562378132000778/1166447479609376850/1207458257874984970

Q: Is there a way to write out a dataframe to parquet with hive-style partitioning without having to create a table provider? I am pretty sure that a ListingTableProvider or a custom table provider will work but that seems like a ton of config for this

Describe the solution you'd like

I would like to be able to use DataFrame::write_parquet and the other APIs to write partitioned files

I suggest adding the table_partition_cols from ListingOptions as one of the options on https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrameWriteOptions.html

So way to specify partition information would be as described on ListingOptions::with_table_partition_cols

So that would look something like

let options = DataFrameWriteOptions::new()
  .with_table_partition_cols(vec![
      ("col_a".to_string(), DataType::Utf8),
  ]);

// write the data frame to parquet
// producing files like
// /tmp/my_table/col_a=foo/12345.parquet (data with 'foo' in column a)
// ..
// /tmp/my_table/col_a=zoo/12345.parquet (data with 'zoo' in column a)
df.write_parquet("/tmp/my_table", &options, None).await?

Describe alternatives you've considered

No response

Additional context

Possibly related to https://github.com/apache/arrow-datafusion/issues/8493

alamb avatar Feb 15 '24 11:02 alamb

Dataframe::write_parquet and related methods use the COPY logical/ physical plans under the hood, so if we knock out #8493 this ticket should come almost for free.

devinjdangelo avatar Feb 15 '24 13:02 devinjdangelo

I went ahead and implemented this and #8493 in #9240. Let me know if it looks good to you @alamb .

devinjdangelo avatar Feb 15 '24 15:02 devinjdangelo

@devinjdangelo implemented the code in https://github.com/apache/arrow-datafusion/pull/9240

In order to close this ticket we just need to add test coverage for writing partitioned parquet in DataFrame::write_parquet

My suggestion is:

  1. Move the existing tests https://github.com/apache/arrow-datafusion/blob/4d389c2590370d85bfe3af77f5243d5b40f5a222/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L2070 tests into the dataframe tests https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs
  2. Add a new test in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs following the same model, to verify the parquet files were written:

The new test could basically do the same thing as the tests added in https://github.com/apache/arrow-datafusion/pull/9240/files#diff-b7d6c89870d082cac4ecc6de05f2ec393559327472fc4a846986f02c812f661fR34

  1. write to a partitioned table
  2. read back from the table to ensure all data went there
  3. Read back from one of the partitions to ensure the data was actually partitioned

alamb avatar Feb 19 '24 07:02 alamb