DeepSpeedExamples Question: Why not padding to the same sequence length within the batch during the sft training phase?

Question: Why not padding to the same sequence length within the batch during the sft training phase?

Open LKLKyy opened this issue 1 year ago • 0 comments

Question: In the SFT training phase in dschat, I found that function create_dataset_split in data_utils.py will pad the samples to the maximum length. Therefore, why not dynamically padding to the maximum length of samples in the batch during training, which can significantly speed up training. if chosen_sentence is not None: chosen_sentence += end_of_conversation_token chosen_token = tokenizer(chosen_sentence, max_length=max_seq_len, padding="max_length", truncation=True, return_tensors="pt") chosen_token["input_ids"] = chosen_token["input_ids"].squeeze(0) chosen_token["attention_mask"] = chosen_token["attention_mask"].squeeze(0) chosen_dataset.append(chosen_token)

Jan 12 '24 08:01 LKLKyy

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Question: Why not padding to the same sequence length within the batch during the sft training phase?

DeepSpeedExamples
DeepSpeedExamples copied to clipboard