datafusion
datafusion copied to clipboard
Support writing hive style partitioned files in `DataFrame::write` command
Is your feature request related to a problem or challenge?
@Omega359 asked on discord: https://discord.com/channels/885562378132000778/1166447479609376850/1207458257874984970
Q: Is there a way to write out a dataframe to parquet with hive-style partitioning without having to create a table provider? I am pretty sure that a ListingTableProvider or a custom table provider will work but that seems like a ton of config for this
Describe the solution you'd like
I would like to be able to use DataFrame::write_parquet
and the other APIs to write partitioned files
I suggest adding the table_partition_cols
from ListingOptions as one of the options on https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrameWriteOptions.html
So way to specify partition information would be as described on ListingOptions::with_table_partition_cols
So that would look something like
let options = DataFrameWriteOptions::new()
.with_table_partition_cols(vec![
("col_a".to_string(), DataType::Utf8),
]);
// write the data frame to parquet
// producing files like
// /tmp/my_table/col_a=foo/12345.parquet (data with 'foo' in column a)
// ..
// /tmp/my_table/col_a=zoo/12345.parquet (data with 'zoo' in column a)
df.write_parquet("/tmp/my_table", &options, None).await?
Describe alternatives you've considered
No response
Additional context
Possibly related to https://github.com/apache/arrow-datafusion/issues/8493
Dataframe::write_parquet and related methods use the COPY logical/ physical plans under the hood, so if we knock out #8493 this ticket should come almost for free.
I went ahead and implemented this and #8493 in #9240. Let me know if it looks good to you @alamb .
@devinjdangelo implemented the code in https://github.com/apache/arrow-datafusion/pull/9240
In order to close this ticket we just need to add test coverage for writing partitioned parquet in DataFrame::write_parquet
My suggestion is:
- Move the existing tests https://github.com/apache/arrow-datafusion/blob/4d389c2590370d85bfe3af77f5243d5b40f5a222/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L2070 tests into the dataframe tests https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs
- Add a new test in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs following the same model, to verify the parquet files were written:
The new test could basically do the same thing as the tests added in https://github.com/apache/arrow-datafusion/pull/9240/files#diff-b7d6c89870d082cac4ecc6de05f2ec393559327472fc4a846986f02c812f661fR34
- write to a partitioned table
- read back from the table to ensure all data went there
- Read back from one of the partitions to ensure the data was actually partitioned