ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: The training code of reward model may be wrong

Open Luoyang144 opened this issue 1 year ago β€’ 10 comments

πŸ› Describe the bug

I'm tring to train a reward model with example, but after ten epochs training its eval result still get dist=nan, acc=0.

Is there any wrong in training code?

Environment

installed with here

Luoyang144 avatar Apr 07 '23 02:04 Luoyang144

I find the bug, the loss function may be wrong. Here is the reward and loss in training process:

chosen reward tensor([-0.0287], device='cuda:0', dtype=torch.float16,                                                                           | 0/100 [00:00<?, ?it/s]
       grad_fn=<SqueezeBackward1>)
reject reward tensor([-0.0312], device='cuda:0', dtype=torch.float16,
       grad_fn=<SqueezeBackward1>)
loss tensor(0.6924, device='cuda:0', dtype=torch.float16, grad_fn=<MeanBackward0>)

Once you backward a step, the score and loss will be nan:

chosen reward tensor([nan], device='cuda:0', dtype=torch.float16, grad_fn=<SqueezeBackward1>)                            | 1/100 [00:01<03:15,  1.97s/it, dist=0, acc=0]
reject reward tensor([nan], device='cuda:0', dtype=torch.float16, grad_fn=<SqueezeBackward1>)
loss tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<MeanBackward0>)

Any idea about how to solve this problem?

Luoyang144 avatar Apr 07 '23 05:04 Luoyang144

Same problem.

YiAthena avatar Apr 08 '23 15:04 YiAthena

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Same problem.

Issues-translate-bot avatar Apr 08 '23 15:04 Issues-translate-bot

try use other strategies, like colossalai_zero2

HuangLK avatar Apr 09 '23 03:04 HuangLK

Same problem.

MyHerbTea avatar Apr 11 '23 10:04 MyHerbTea

try use other strategies, like colossalai_zero2

@HuangLK I tried other strategies, but the problem still exists. Why do you think other strategies would solve this problem? Thanks~

LuciusMos avatar Apr 11 '23 11:04 LuciusMos

I find the problem(maybe), if you delete model = model.to(torch.float16) in the python file you can get a normal loss number, but the accuracy and distance will just slightly change in training process. I don't know whether it can solve the problem.

Luoyang144 avatar Apr 11 '23 11:04 Luoyang144

I find the problem(maybe), if you delete model = model.to(torch.float16) in the python file you can get a normal loss number, but the accuracy and distance will just slightly change in training process. I don't know whether it can solve the problem.

@Luoyang144 Thank you so much for the information. Deleting this line applies to me as well. My [dist, acc] increased from [0.01, 0.60] to [0.45, 0.66]. I used "ddp" strategy and trained on 8 GPUs.

LuciusMos avatar Apr 12 '23 03:04 LuciusMos

@LuciusMos Thanks for sharing!

Luoyang144 avatar Apr 14 '23 01:04 Luoyang144

Hi @Luoyang144 @LuciusMos @MyHerbTea @YiAthena After verification, this is not a bug caused by the code, but an inappropriate sh command. We have fixed it. Thanks. https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_rm.sh

binmakeswell avatar Apr 17 '23 07:04 binmakeswell

@LuciusMos Thanks for sharing!

Hello, do you finally solve the problem? I use the newest sh command , but the problem (dist=nan, acc=0) still exists.

TTYee avatar May 25 '23 13:05 TTYee