DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Question: Why not padding to the same sequence length within the batch during the sft training phase?

Open LKLKyy opened this issue 1 year ago • 0 comments

Question: In the SFT training phase in dschat, I found that function create_dataset_split in data_utils.py will pad the samples to the maximum length. Therefore, why not dynamically padding to the maximum length of samples in the batch during training, which can significantly speed up training. if chosen_sentence is not None: chosen_sentence += end_of_conversation_token chosen_token = tokenizer(chosen_sentence, max_length=max_seq_len, padding="max_length", truncation=True, return_tensors="pt") chosen_token["input_ids"] = chosen_token["input_ids"].squeeze(0) chosen_token["attention_mask"] = chosen_token["attention_mask"].squeeze(0) chosen_dataset.append(chosen_token)

LKLKyy avatar Jan 12 '24 08:01 LKLKyy