to_parquet() generating a big number of files
Hey guys, I'm currently trying to save a big dataframe (shape: (191862718, 44)) to parquet using Koalas built in function (to_parquet).
For some reason, my parquet files are being saved with a sizes from 1.8 MB to 2.8MB, which is creating more than 1.700 files. Whenever I try to read this files (also using Koalas) to generate a Dataframe, I have two bottlenecks:
- Reading a big number of files, which is not a best practice when using parquet
- Appending/concating this files to a single dataframe which I'll be working with later
Althought the process of saving the parquet files is quite fast (~18min to save all files against more than 1 hour to save a CSV with the same number of files), the process to read and append each file information into a single dataframe is taking way longer than simply using CSV.
How can I change the parquet maximum file size (or other config) using Koalas?
You can use DataFrame.spark.coalesce or DataFrame.spark.repartition to control the file number.
df.spark.coalesce(num_files).to_parquet(...)
For reading, you don't need to append/concatenate the files by yourself. Koalas will automatically do it by specifying those parent folder.
@ueshin that worked just fine, thank you.
As a suggestion for next developments (if I may and if this makes any sense): could you add an option inside the to_parquet() function to configure the number of files? Similar to the option num_files in to_csv() function.
Glad to hear it worked fine.
Also thanks for the suggestion!
Do you want to submit a PR to support it?
I guess we should add it to to_orc as well.
cc @HyukjinKwon
Thanks.