koalas icon indicating copy to clipboard operation
koalas copied to clipboard

to_parquet() generating a big number of files

Open ThiagoCM opened this issue 4 years ago • 3 comments

Hey guys, I'm currently trying to save a big dataframe (shape: (191862718, 44)) to parquet using Koalas built in function (to_parquet).

For some reason, my parquet files are being saved with a sizes from 1.8 MB to 2.8MB, which is creating more than 1.700 files. Whenever I try to read this files (also using Koalas) to generate a Dataframe, I have two bottlenecks:

  1. Reading a big number of files, which is not a best practice when using parquet
  2. Appending/concating this files to a single dataframe which I'll be working with later

Althought the process of saving the parquet files is quite fast (~18min to save all files against more than 1 hour to save a CSV with the same number of files), the process to read and append each file information into a single dataframe is taking way longer than simply using CSV.

How can I change the parquet maximum file size (or other config) using Koalas?

ThiagoCM avatar Feb 12 '21 15:02 ThiagoCM

You can use DataFrame.spark.coalesce or DataFrame.spark.repartition to control the file number.

df.spark.coalesce(num_files).to_parquet(...)

For reading, you don't need to append/concatenate the files by yourself. Koalas will automatically do it by specifying those parent folder.

ueshin avatar Feb 12 '21 18:02 ueshin

@ueshin that worked just fine, thank you.

As a suggestion for next developments (if I may and if this makes any sense): could you add an option inside the to_parquet() function to configure the number of files? Similar to the option num_files in to_csv() function.

ThiagoCM avatar Feb 12 '21 20:02 ThiagoCM

Glad to hear it worked fine.

Also thanks for the suggestion! Do you want to submit a PR to support it? I guess we should add it to to_orc as well. cc @HyukjinKwon

Thanks.

ueshin avatar Feb 12 '21 21:02 ueshin