DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Question: Why not padding to the same sequence length within the batch during the sft training phase?
Question: In the SFT training phase in dschat, I found that function create_dataset_split in data_utils.py will pad the samples to the maximum length. Therefore, why not dynamically padding to the maximum length of samples in the batch during training, which can significantly speed up training.
if chosen_sentence is not None:
chosen_sentence += end_of_conversation_token
chosen_token = tokenizer(chosen_sentence,
max_length=max_seq_len,
padding="max_length",
truncation=True,
return_tensors="pt")
chosen_token["input_ids"] = chosen_token["input_ids"].squeeze(0)
chosen_token["attention_mask"] = chosen_token["attention_mask"].squeeze(0)
chosen_dataset.append(chosen_token)