[BUG]: Nan loss in continue pretrain

Open YeAnbang opened this issue 1 year ago • 0 comments

🐛 Describe the bug

I logged the loss when running the continue pretrain script. The loss is nan for some batches. It turns out these lines are buggy. https://github.com/hpcaitech/ColossalAI/blob/2dd01e3a1430f223b9ef8e61b73cf17f60fccb07/applications/Colossal-LLaMA-2/colossal_llama2/dataset/spliced_and_tokenized_dataset.py#L59C1-L62C55

    # sequence truncation.
    if len(sequence_input_ids) > max_length:
        sequence_input_ids = sequence_input_ids[:max_length]
        sequence_labels = sequence_labels[:max_length]

If the context is longer than max_length, all labels are truncated, and sequence labels will be all -100, which causes the NaN.

Environment

No response

Feb 06 '24 06:02 YeAnbang