CogVideo loss is NAN when training some steps, sat sft type

System Info / 系統信息

cuda11.8/torch2.4

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

only change the dataset path, the NAN log:

[2024-09-14 16:50:54,425] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[2024-09-14 16:51:01,952] [INFO] [RANK 0]  iteration      120/    1000 | elapsed time per iteration (ms): 11114.2 | learning rate 9.060E-04 | total loss 2.123193E-01 | loss 2.123193E-01 | loss scale 32768.0 |speed 5.40 samples/(min*GPU)
[2024-09-14 16:51:01,952] [INFO] [RANK 0] time (ms) | forward: 3775.64 | backward: 2752.39 | allreduce: 0.00 | optimizer: 189.44 | data loader: 0.64
[2024-09-14 16:52:48,422] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2024-09-14 16:53:15,806] [INFO] [RANK 0]  iteration      140/    1000 | elapsed time per iteration (ms): 6692.7 | learning rate 8.870E-04 | total loss 7.151500E-01 | loss 7.151500E-01 | loss scale 16384.0 |speed 8.96 samples/(min*GPU)
[2024-09-14 16:53:15,807] [INFO] [RANK 0] time (ms) | forward: 3770.69 | backward: 2731.73 | allreduce: 0.00 | optimizer: 187.49 | data loader: 0.57
[2024-09-14 16:54:21,791] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=18, lr=[0.000878, 0.000878], mom=[[0.9, 0.95], [0.9, 0.95]]
[2024-09-14 16:55:11,984] [INFO] [RANK 0] Skipping backward and optimizer step for nan or inf in forwarding metrics/loss!
[2024-09-14 16:55:25,091] [INFO] [RANK 0]  iteration      160/    1000 | elapsed time per iteration (ms): 6464.2 | learning rate 8.680E-04 | total loss NAN | loss 9.045437E-01 | loss scale 16384.0 |speed 9.28 samples/(min*GPU)
NaN or Inf found in input tensor.
[2024-09-14 16:55:25,093] [INFO] [RANK 0] time (ms) | forward: 3713.26 | backward: 2566.17 | allreduce: 0.00 | optimizer: 181.86 | data loader: 0.57
[2024-09-14 16:57:01,495] [INFO] [RANK 0] Skipping backward and optimizer step for nan or inf in forwarding metrics/loss!
[2024-09-14 16:57:34,516] [INFO] [RANK 0]  iteration      180/    1000 | elapsed time per iteration (ms): 6471.2 | learning rate 8.490E-04 | total loss NAN | loss 7.618577E-01 | loss scale 16384.0 |speed 9.27 samples/(min*GPU)
NaN or Inf found in input tensor.
[2024-09-14 16:57:34,517] [INFO] [RANK 0] time (ms) | forward: 3724.61 | backward: 2560.48 | allreduce: 0.00 | optimizer: 183.58 | data loader: 0.60
[2024-09-14 16:59:47,000] [INFO] [RANK 0]  iteration      200/    1000 | elapsed time per iteration (ms): 6624.2 | learning rate 8.290E-04 | total loss 6.272916E-01 | loss 6.272915E-01 | loss scale 16384.0 |speed 9.06 samples/(min*GPU)
[2024-09-14 16:59:47,002] [INFO] [RANK 0] time (ms) | forward: 3730.16 | backward: 2697.21 | allreduce: 0.00 | optimizer: 194.12 | data loader: 0.63

Expected behavior / 期待表现

solve it

Sep 23 '24 13:09 AlphaNext

Hi, guys, we may have the same problem. Add a wechat?

Sep 24 '24 07:09 zibojia

Hi, guys, we may have the same problem. Add a wechat?

maybe the learning rate is larger, the default lr is 0.001

# configs/sft.yaml
# Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT

Sep 24 '24 12:09 AlphaNext

Hi, guys, we may have the same problem. Add a wechat?

maybe the learning rate is larger, the default lr is 0.001
# configs/sft.yaml
# Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT

@AlphaNext where do you see default is 0.001? In sft.yaml it's lr: 0.00001

Sep 26 '24 00:09 Eurus-Holmes

Hi, guys, we may have the same problem. Add a wechat?

maybe the learning rate is larger, the default lr is 0.001
# configs/sft.yaml
# Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT
@AlphaNext where do you see default is 0.001? In sft.yaml it's lr: 0.00001

https://github.com/THUDM/CogVideo/blob/4a2af29867ed71ca9c739de633c80746a5915208/sat/configs/sft.yaml#L55-L61

Sep 27 '24 09:09 AlphaNext