streaming
streaming copied to clipboard
write to S3 is very slow
Environment
- OS: [Ubuntu 20.04]
- Hardware (GPU, or instance type): [H800]
To reproduce
I got a 2G jsonl.gz text file, I tokenized it with data stored as numpy array. Writer is defined as below:
out = MDSWriter(
columns={"input_ids": f"ndarray:int32:{args.seq_len}",
"token_type_ids": f"ndarray:int8:{args.seq_len}",
"attention_mask": f"ndarray:int8:{args.seq_len}",
"special_tokens_mask": f"ndarray:int8:{args.seq_len}"},
out=out_path,
compression=None
)
tokenized data is pre-processed and loaded already, so time won't be wasted at tokenization.
def parse_data_2_mds_format(tokenized_dataset):
input_ids = np.array(tokenized_dataset['input_ids']).astype(np.int32)
token_type_ids = np.array(tokenized_dataset['token_type_ids']).astype(np.int8)
attention_mask = np.array(tokenized_dataset['attention_mask']).astype(np.int8)
special_tokens_mask = np.array(tokenized_dataset['special_tokens_mask']).astype(np.int8)
return {'input_ids': input_ids, 'token_type_ids': token_type_ids,
'attention_mask': attention_mask, 'special_tokens_mask': special_tokens_mask}
with Pool(processes=args.mds_num_workers) as inner_pool:
with tqdm(total=len(tokenized_datasets), desc="Writing Out MDS File") as pbar:
for result in inner_pool.imap(parse_data_2_mds_format, tokenized_datasets):
out.write(result)
pbar.update()
With code above, the writing time to S3 takes 30 minutes with mds_num_workers set as 200. If I set to 1, it takes 1 hour to finish. It's just so slow, I have huge data to process. How to accelerate? Is that possible to write a block of data once rather than one by one? Please give some suggestion to accelerate.
Expected behavior
Additional context
Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html
Please let us know if that works for you.
I'm just using the parallel format You just mentioned. Please check my code. If i am using it wrongly or the speed is just as this kind of slow.
By the way, I try to copy files to the s3 With a command 30g Only takes less than five minutes. But currently I need to use half an hour with our lib To transmit samples one by one with parallel. I have a huge data, if speed is this kind of slow I cannot use this.
Please give some help.
---Original--- From: "Saaketh @.> Date: Fri, Oct 25, 2024 21:27 PM To: @.>; Cc: "Optimus @.@.>; Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812)
Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html
Please let us know if that works for you.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I also try to output the data to the local disk but the speed is still the same. However, if i am using the dataset lib to save to the disk. It is only a few seconds.
---Original--- From: "Saaketh @.> Date: Fri, Oct 25, 2024 21:27 PM To: @.>; Cc: "Optimus @.@.>; Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812)
Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html
Please let us know if that works for you.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>