Output files with .csv file extension instead of .part when to_csv is used
The to_csv method outputs filenames with a .part extension by default. This post argues that to_csv should output CSV files with a .csv extension by default.
Let's create a DataFrame and write it out to a directory and see the default behavior.
import pandas as pd
import dask.dataframe as dd
pdf = pd.DataFrame(
{"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)
ddf = dd.from_pandas(pdf, npartitions=2)
ddf.to_csv("./data/csv_simple")
Here are the files that are output:
csv_simple/
0.part
1.part
Downstream readers need to do something like dd.read_csv("./data/csv_simple/*.part") to read these CSV files into a DataFrame.
Here's the work-around to output the files with the .csv file extension:
ddf.to_csv("./data/csv_simple2/whatever-*-hi.csv")
Here are the files that are output:
csv_simple2/
whatever-0-hi.csv
whatever-1-hi.csv
I don't believe there is a work around to customize the file extension and avoid writing .part when you use the name_function argument. Take a look at this example:
ddf.to_csv("./data/csv_simple3", name_function = lambda x: f"i-like-{x}.csv")
Here's what's output:
csv_simple3/
i-like-0.csv.part
i-like-1.csv.part
I'd personally prefer for files to be written with a .csv extension by default. That'd be more intuitive for me.
Think it's also more consistent with the Parquet writers. For example ddf.to_parquet("./data/parquet_simple") outputs files like part.0.parquet and part.1.parquet. Let me know what you think!
Thanks for raising this @MrPowers. I agree using .csv seems more intuitive than .part. I'm not totally sure why .part was chosen originally (maybe @martindurant might know?). Also cc @rjzamora for thoughts
EDIT: If we end up making a change from .part -> .csv we'll need to think about impacts on existing user code / workflows
I have no memory of it, no
I also find it curious that .part was originally chosen over .csv. It seems to me like we should move away from this, but I agree with @jrbourbeau that changing the default could cause pain for some users. Maybe as a first pass we could just add an explicit file_extension=".part" argument? Or if we do want to set the defatult to file_extension=".csv", we may want to add a warning on the read side if the user specifies ".part" in a glob pattern?
Okay - @MrPowers and I spent some time looking into this today, and it seems that it is fsspec (and not Dask) that is currently deciding to add a ”.part” file extension when the user passes to_csv a directory name. This is because to_csv currently uses fsspec’s open_files function to expand a single-directory name into a list of open files. As far as I can tell, open_files does not provide a mechanism for the user to specify a different file extension.
@martindurant - Do you think it would be reasonable to add some kind of file-extension option to open_files (which would probably mean adding the same option to get_fs_token_paths (where the actual path expansion is performed)? Or- do you think to_csv should stop using open_files, and simply define the expanded file names itself?
here ? Yes, it's clearly something that could be optional and would make sense to surface. Seems like name_function was indeed meant for this - but doesn't fit the bill.
As stated above, though, the quick fix for dask is to add it's own "*" into the path.
I am really surprised this hasn't arisen before.
Just updating this issue. There was an attempt at resolving this issue over in https://github.com/dask/dask/pull/9073. That PR was closed due to lack of developer bandwidth, but was a good start in case someone else wants to pick this issue up.