DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

My model Performs Badly...Is GPU memory to small?

Open Trace2333 opened this issue 1 year ago • 12 comments

Hi! I trained the model just as you directed, but the model generation is very very bad.It can not even speak a complete sentence...And when I train step3, its reward score is nan.What happened when training? Please help me......Thanks very much! Just like this:

|E2E latency=3.25s |Gather latency=0.00s (0.00%) |Generate time=2.35s (72.45%) |Training time=0.77s (23.71%) |Others=0.12 (3.84%)|CurSamplesPerSec=1.23 |AvgSamplesPerSec=1.00
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,058] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,174] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
epoch: 0|step: 3721|ppo_ep: 1|act_loss: nan|cri_loss: nan|unsuper_loss: 0.0
average reward score: nan

I noticed that: [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1 Is the GPU memory too small?

The chat example:

------------------------------ Round 1 ------------------------------                                                                                         Human: Hello!                                                                                                                                                Assistant:  I’m sorry, I’m not sure                                                                                                                       Enter input (type 'quit' to exit, 'clear' to clean memory): What is your name?                                                                               ------------------------------ Round 2 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): Can you speak?
------------------------------ Round 3 ------------------------------                                                                                [2/1813] Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): I think you are saying I?What happened to you?
------------------------------ Round 4 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I

 Human: I think you are saying I?What happened to you?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory):

My Device:

single-GPU 1x3090 24GB
batch_size 4 for training and eval.

Environment:

python                    3.8.0
deepspeed                 0.9.0
huggingface-hub           0.5.1
pytorch                   1.12.1          py3.8_cuda11.3_cudnn8.3.2_0
transformers              4.20.0

Thanks for your answering!

Trace2333 avatar Apr 15 '23 02:04 Trace2333

Which model is this? The 1.3B?

mrwyattii avatar Apr 17 '23 16:04 mrwyattii

Which model is this? The 1.3B?

Yes, 1.3B actor-model with 350M reward model. I retrained it, but it still performs almost the same.

Trace2333 avatar Apr 19 '23 01:04 Trace2333

me too

Assistant: I’m sorry, but I’m not sure what you’re asking.

loonxi avatar Apr 21 '23 09:04 loonxi

@Trace2333 @loonxi We just pushed an update (#346) to the scripts to train the models and also provide a bit more detail about the settings we are using and potential instability in the training. Please take a look here: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md

mrwyattii avatar Apr 21 '23 23:04 mrwyattii

Same here, please see my attached screenshot: image

bingjie3216 avatar Apr 22 '23 19:04 bingjie3216

Yes, encountered the same strange thing: 截屏2023-04-23上午9 12 12

I used the command for training (I changed the reward model from 350m to 120m, as an error occurred when using 350m, maybe OOM #380): 截屏2023-04-23上午9 12 31

addtional about my machine: I get 2 gpus (40G for each), and the cuda info:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

vpegasus avatar Apr 23 '23 01:04 vpegasus

I rerun the new updated (1.3b with 350m )scripts, and the results are still supervising bad...:

with trained actor: 截屏2023-04-24上午11 47 39

with trained actor_ema: 截屏2023-04-24上午11 47 52

what's wrong... sad...

vpegasus avatar Apr 24 '23 03:04 vpegasus

@Trace2333 @loonxi We just pushed an update (#346) to the scripts to train the models and also provide a bit more detail about the settings we are using and potential instability in the training. Please take a look here: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md

Thanks for your help! I'm using the updated scripts to retrain the model. Hoping the model could work.

Trace2333 avatar Apr 24 '23 04:04 Trace2333

@Trace2333 @vpegasus @bingjie3216 @loonxi @mrwyattii this issue might be useful to you: https://github.com/microsoft/DeepSpeedExamples/issues/442

DanqingZ avatar May 07 '23 23:05 DanqingZ

@Trace2333 @vpegasus @bingjie3216 @loonxi @mrwyattii this issue might be useful to you: #442

thanks for your kindly reply, I'll try later.

vpegasus avatar May 19 '23 02:05 vpegasus

I also met the problem that reward is nan. I'm so confused how nan come for reward, it is a scaler from another model's output.

ymr12 avatar Jun 05 '23 09:06 ymr12