streaming write to S3 is very slow

Environment

OS: [Ubuntu 20.04]
Hardware (GPU, or instance type): [H800]

To reproduce

I got a 2G jsonl.gz text file, I tokenized it with data stored as numpy array. Writer is defined as below:

   out = MDSWriter(
        columns={"input_ids": f"ndarray:int32:{args.seq_len}",
                 "token_type_ids": f"ndarray:int8:{args.seq_len}",
                 "attention_mask": f"ndarray:int8:{args.seq_len}",
                 "special_tokens_mask": f"ndarray:int8:{args.seq_len}"},
        out=out_path,
        compression=None
    )

tokenized data is pre-processed and loaded already, so time won't be wasted at tokenization.

def parse_data_2_mds_format(tokenized_dataset):
    input_ids = np.array(tokenized_dataset['input_ids']).astype(np.int32)
    token_type_ids = np.array(tokenized_dataset['token_type_ids']).astype(np.int8)
    attention_mask = np.array(tokenized_dataset['attention_mask']).astype(np.int8)
    special_tokens_mask = np.array(tokenized_dataset['special_tokens_mask']).astype(np.int8)
    return {'input_ids': input_ids, 'token_type_ids': token_type_ids,
            'attention_mask': attention_mask, 'special_tokens_mask': special_tokens_mask}

 with Pool(processes=args.mds_num_workers) as inner_pool:
                with tqdm(total=len(tokenized_datasets), desc="Writing Out MDS File") as pbar:
                    for result in inner_pool.imap(parse_data_2_mds_format, tokenized_datasets):
                        out.write(result)
                        pbar.update()

With code above, the writing time to S3 takes 30 minutes with mds_num_workers set as 200. If I set to 1, it takes 1 hour to finish. It's just so slow, I have huge data to process. How to accelerate? Is that possible to write a block of data once rather than one by one? Please give some suggestion to accelerate.

Expected behavior

Additional context

Oct 25 '24 08:10 charliedream1

Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html

Please let us know if that works for you.

Oct 25 '24 13:10 snarayan21

I'm just using the parallel format You just mentioned. Please check my code. If i am using it wrongly or the speed is just as this kind of slow.

By the way, I try to copy files to the s3 With a command 30g Only takes less than five minutes. But currently I need to use half an hour with our lib To transmit samples one by one with parallel. I have a huge data, if speed is this kind of slow I cannot use this.

Please give some help.

---Original--- From: "Saaketh @.> Date: Fri, Oct 25, 2024 21:27 PM To: @.>; Cc: "Optimus @.@.>; Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812)

Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html

Please let us know if that works for you.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Oct 25 '24 13:10 charliedream1

I also try to output the data to the local disk but the speed is still the same. However, if i am using the dataset lib to save to the disk. It is only a few seconds.

---Original--- From: "Saaketh @.> Date: Fri, Oct 25, 2024 21:27 PM To: @.>; Cc: "Optimus @.@.>; Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812)

Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html

Please let us know if that works for you.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Oct 25 '24 15:10 charliedream1

streaming streaming copied to clipboard

write to S3 is very slow

Environment

To reproduce

Expected behavior

Additional context

streaming
streaming copied to clipboard