DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

My model Performs Badly...Is GPU memory to small?

Open Trace2333 opened this issue 1 year ago • 8 comments

Hi! I trained the model just as you directed, but the model generation is very very bad.It can not even speak a complete sentence...And when I train step3, its reward score is nan.What happened when training? Please help me......Thanks very much! Just like this:

|E2E latency=3.25s |Gather latency=0.00s (0.00%) |Generate time=2.35s (72.45%) |Training time=0.77s (23.71%) |Others=0.12 (3.84%)|CurSamplesPerSec=1.23 |AvgSamplesPerSec=1.00
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,058] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,174] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
epoch: 0|step: 3721|ppo_ep: 1|act_loss: nan|cri_loss: nan|unsuper_loss: 0.0
average reward score: nan

I noticed that: [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1 Is the GPU memory too small?

The chat example:

------------------------------ Round 1 ------------------------------                                                                                         Human: Hello!                                                                                                                                                Assistant:  I’m sorry, I’m not sure                                                                                                                       Enter input (type 'quit' to exit, 'clear' to clean memory): What is your name?                                                                               ------------------------------ Round 2 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): Can you speak?
------------------------------ Round 3 ------------------------------                                                                                [2/1813] Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): I think you are saying I?What happened to you?
------------------------------ Round 4 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I

 Human: I think you are saying I?What happened to you?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory):

My Device:

single-GPU 1x3090 24GB
batch_size 4 for training and eval.

Environment:

python                    3.8.0
deepspeed                 0.9.0
huggingface-hub           0.5.1
pytorch                   1.12.1          py3.8_cuda11.3_cudnn8.3.2_0
transformers              4.20.0

Thanks for your answering!

Trace2333 avatar Apr 15 '23 02:04 Trace2333