dask Output files with .csv file extension instead of .part when to

The to_csv method outputs filenames with a .part extension by default. This post argues that to_csv should output CSV files with a .csv extension by default.

Let's create a DataFrame and write it out to a directory and see the default behavior.

import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame(
    {"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)

ddf = dd.from_pandas(pdf, npartitions=2)

ddf.to_csv("./data/csv_simple")

Here are the files that are output:

csv_simple/
  0.part
  1.part

Downstream readers need to do something like dd.read_csv("./data/csv_simple/*.part") to read these CSV files into a DataFrame.

Here's the work-around to output the files with the .csv file extension:

ddf.to_csv("./data/csv_simple2/whatever-*-hi.csv")

Here are the files that are output:

csv_simple2/
  whatever-0-hi.csv
  whatever-1-hi.csv

I don't believe there is a work around to customize the file extension and avoid writing .part when you use the name_function argument. Take a look at this example:

ddf.to_csv("./data/csv_simple3", name_function = lambda x: f"i-like-{x}.csv")

Here's what's output:

csv_simple3/
  i-like-0.csv.part
  i-like-1.csv.part

I'd personally prefer for files to be written with a .csv extension by default. That'd be more intuitive for me.

Think it's also more consistent with the Parquet writers. For example ddf.to_parquet("./data/parquet_simple") outputs files like part.0.parquet and part.1.parquet. Let me know what you think!

May 05 '22 23:05 MrPowers

Thanks for raising this @MrPowers. I agree using .csv seems more intuitive than .part. I'm not totally sure why .part was chosen originally (maybe @martindurant might know?). Also cc @rjzamora for thoughts

EDIT: If we end up making a change from .part -> .csv we'll need to think about impacts on existing user code / workflows

May 11 '22 15:05 jrbourbeau

I have no memory of it, no

May 11 '22 15:05 martindurant

I also find it curious that .part was originally chosen over .csv. It seems to me like we should move away from this, but I agree with @jrbourbeau that changing the default could cause pain for some users. Maybe as a first pass we could just add an explicit file_extension=".part" argument? Or if we do want to set the defatult to file_extension=".csv", we may want to add a warning on the read side if the user specifies ".part" in a glob pattern?

May 11 '22 15:05 rjzamora

Okay - @MrPowers and I spent some time looking into this today, and it seems that it is fsspec (and not Dask) that is currently deciding to add a ”.part” file extension when the user passes to_csv a directory name. This is because to_csv currently uses fsspec’s open_files function to expand a single-directory name into a list of open files. As far as I can tell, open_files does not provide a mechanism for the user to specify a different file extension.  

@martindurant - Do you think it would be reasonable to add some kind of file-extension option to open_files (which would probably mean adding the same option to get_fs_token_paths (where the actual path expansion is performed)? Or- do you think to_csv should stop using open_files, and simply define the expanded file names itself?

May 11 '22 17:05 rjzamora

here ? Yes, it's clearly something that could be optional and would make sense to surface. Seems like name_function was indeed meant for this - but doesn't fit the bill.

As stated above, though, the quick fix for dask is to add it's own "*" into the path.

I am really surprised this hasn't arisen before.

May 11 '22 18:05 martindurant

Just updating this issue. There was an attempt at resolving this issue over in https://github.com/dask/dask/pull/9073. That PR was closed due to lack of developer bandwidth, but was a good start in case someone else wants to pick this issue up.

Aug 25 '22 17:08 jrbourbeau

Output files with .csv file extension instead of .part when to_csv is used