ColossalAI
ColossalAI copied to clipboard
[BUG]: Nan loss in continue pretrain
🐛 Describe the bug
I logged the loss when running the continue pretrain script. The loss is nan for some batches. It turns out these lines are buggy. https://github.com/hpcaitech/ColossalAI/blob/2dd01e3a1430f223b9ef8e61b73cf17f60fccb07/applications/Colossal-LLaMA-2/colossal_llama2/dataset/spliced_and_tokenized_dataset.py#L59C1-L62C55
# sequence truncation.
if len(sequence_input_ids) > max_length:
sequence_input_ids = sequence_input_ids[:max_length]
sequence_labels = sequence_labels[:max_length]
If the context is longer than max_length, all labels are truncated, and sequence labels will be all -100, which causes the NaN.
Environment
No response