save_to_disk() freezes when saving on s3 bucket with multiprocessing
Describe the bug
I'm trying to save a Dataset using the save_to_disk() function with:
num_proc > 1dataset_pathbeing a s3 bucket path e.g. "s3://{bucket_name}/{dataset_folder}/"
The hf progress bar shows up but the saving does not seem to start.
When using one processor only (num_proc=1), everything works fine.
When saving the dataset on local disk (as opposed to s3 bucket) with num_proc > 1, everything works fine.
Thank you for your help! :)
Steps to reproduce the bug
I tried without any storage options:
from datasets import load_dataset
sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
"s3://bucket-name/test_multiprocessing_saving/",
num_proc=4,
)
and with the specific s3fs storage options:
from datasets import load_dataset
from s3fs import S3FileSystem
def get_s3fs():
return S3FileSystem()
sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
"s3://bucket-name/test_multiprocessing_saving/",
num_proc=4,
storage_options=get_s3fs().storage_options, # also tried: storage_options=S3FileSystem().storage_options
)
I'm guessing I might use storage_options parameter wrongly, but I didn't find anything online that made it work.
NB: Behavior is the same when trying to save the whole DatasetDict.
Expected behavior
Progress bar fills in and saving is carried out.
Environment info
datasets==2.18.0
I got the same issue. Any updates so far for this issue?
Same here. Any updates?
+1, experiencing this as well