datasets icon indicating copy to clipboard operation
datasets copied to clipboard

save_to_disk() freezes when saving on s3 bucket with multiprocessing

Open ycattan opened this issue 1 year ago • 1 comments

Describe the bug

I'm trying to save a Dataset using the save_to_disk() function with:

  • num_proc > 1
  • dataset_path being a s3 bucket path e.g. "s3://{bucket_name}/{dataset_folder}/"

The hf progress bar shows up but the saving does not seem to start. When using one processor only (num_proc=1), everything works fine. When saving the dataset on local disk (as opposed to s3 bucket) with num_proc > 1, everything works fine.

Thank you for your help! :)

Steps to reproduce the bug

I tried without any storage options:

from datasets import load_dataset

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
)

and with the specific s3fs storage options:

from datasets import load_dataset
from s3fs import S3FileSystem

def get_s3fs():
    return S3FileSystem()

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
    storage_options=get_s3fs().storage_options, # also tried: storage_options=S3FileSystem().storage_options
)

I'm guessing I might use storage_options parameter wrongly, but I didn't find anything online that made it work.

NB: Behavior is the same when trying to save the whole DatasetDict.

Expected behavior

Progress bar fills in and saving is carried out.

Environment info

datasets==2.18.0

ycattan avatar May 30 '24 16:05 ycattan

I got the same issue. Any updates so far for this issue?

sfc-gh-ywei avatar Jul 22 '24 23:07 sfc-gh-ywei

Same here. Any updates?

solanovisitor avatar Nov 29 '24 21:11 solanovisitor

+1, experiencing this as well

grapefroot avatar Feb 06 '25 22:02 grapefroot