DeepSpeedExamples
DeepSpeedExamples copied to clipboard
My model Performs Badly...Is GPU memory to small?
Hi! I trained the model just as you directed, but the model generation is very very bad.It can not even speak a complete sentence...And when I train step3, its reward score is nan.What happened when training? Please help me......Thanks very much! Just like this:
|E2E latency=3.25s |Gather latency=0.00s (0.00%) |Generate time=2.35s (72.45%) |Training time=0.77s (23.71%) |Others=0.12 (3.84%)|CurSamplesPerSec=1.23 |AvgSamplesPerSec=1.00
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,058] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,174] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
epoch: 0|step: 3721|ppo_ep: 1|act_loss: nan|cri_loss: nan|unsuper_loss: 0.0
average reward score: nan
I noticed that:
[fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
Is the GPU memory too small?
The chat example:
------------------------------ Round 1 ------------------------------ Human: Hello! Assistant: I’m sorry, I’m not sure Enter input (type 'quit' to exit, 'clear' to clean memory): What is your name? ------------------------------ Round 2 ------------------------------
Human: Hello!
Assistant: I’m sorry, I’m not sure
Human: What is your name?
Assistant: I
Enter input (type 'quit' to exit, 'clear' to clean memory): Can you speak?
------------------------------ Round 3 ------------------------------ [2/1813] Human: Hello!
Assistant: I’m sorry, I’m not sure
Human: What is your name?
Assistant: I
Human: Can you speak?
Assistant: I
Enter input (type 'quit' to exit, 'clear' to clean memory): I think you are saying I?What happened to you?
------------------------------ Round 4 ------------------------------
Human: Hello!
Assistant: I’m sorry, I’m not sure
Human: What is your name?
Assistant: I
Human: Can you speak?
Assistant: I
Human: I think you are saying I?What happened to you?
Assistant: I
Enter input (type 'quit' to exit, 'clear' to clean memory):
My Device:
single-GPU 1x3090 24GB
batch_size 4 for training and eval.
Environment:
python 3.8.0
deepspeed 0.9.0
huggingface-hub 0.5.1
pytorch 1.12.1 py3.8_cuda11.3_cudnn8.3.2_0
transformers 4.20.0
Thanks for your answering!
Which model is this? The 1.3B?
Which model is this? The 1.3B?
Yes, 1.3B actor-model with 350M reward model. I retrained it, but it still performs almost the same.
me too
Assistant: I’m sorry, but I’m not sure what you’re asking.
@Trace2333 @loonxi We just pushed an update (#346) to the scripts to train the models and also provide a bit more detail about the settings we are using and potential instability in the training. Please take a look here: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md
Same here, please see my attached screenshot:
Yes, encountered the same strange thing:
I used the command for training (I changed the reward model from 350m to 120m, as an error occurred when using 350m, maybe OOM #380):
addtional about my machine: I get 2 gpus (40G for each), and the cuda info:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
I rerun the new updated (1.3b with 350m )scripts, and the results are still supervising bad...:
with trained actor:
with trained actor_ema:
what's wrong... sad...
@Trace2333 @loonxi We just pushed an update (#346) to the scripts to train the models and also provide a bit more detail about the settings we are using and potential instability in the training. Please take a look here: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md
Thanks for your help! I'm using the updated scripts to retrain the model. Hoping the model could work.
@Trace2333 @vpegasus @bingjie3216 @loonxi @mrwyattii this issue might be useful to you: https://github.com/microsoft/DeepSpeedExamples/issues/442
@Trace2333 @vpegasus @bingjie3216 @loonxi @mrwyattii this issue might be useful to you: #442
thanks for your kindly reply, I'll try later.
I also met the problem that reward is nan. I'm so confused how nan come for reward, it is a scaler from another model's output.