DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Overflow in deepspeed-chat LoRA and BF16 mode

Open THULiusj opened this issue 2 years ago • 1 comments

  • Example: Deepspeed-chat
  • Model: Llama2-7b-hf
  • Mode: LoRA, lora_dim=128
  • precision: FP16
  • Output log as below:
  • Question: Does the log mean it's training correctly? I found the log is different from the log of SFT and LoRA only mode, which can output loss in each step. If not correct, how to make LoRA mode run correcly?
Model Parameters: 6.927 B, Latency: 6.08s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.06s, TFLOPs: 1.73, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.73, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
[2023-10-20 09:49:17,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=6, lr=[9.618683345445294e-06, 0.0004983773754116733], mom=[(0.9, 0.95), (0.9, 0.95)]                                                                                                                   
[2023-10-20 09:49:17,004] [INFO] [timer.py:260:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=5.187625237365746, CurrSamplesPerSec=5.008550092748559, MemAllocated=3.9GB, MaxMemAllocated=6.71GB                                                                      
Model Parameters: 6.927 B, Latency: 6.39s, TFLOPs: 1.64, Samples/sec: 0.63, Time/seq 1.60s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.06s, TFLOPs: 1.73, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.09s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.08s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512 

THULiusj avatar Oct 20 '23 09:10 THULiusj

the same question

cy565025164 avatar Nov 26 '23 12:11 cy565025164