DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Overflow in deepspeed-chat LoRA and BF16 mode

Open THULiusj opened this issue 9 months ago • 1 comments

  • Example: Deepspeed-chat
  • Model: Llama2-7b-hf
  • Mode: LoRA, lora_dim=128
  • precision: FP16
  • Output log as below:
  • Question: Does the log mean it's training correctly? I found the log is different from the log of SFT and LoRA only mode, which can output loss in each step. If not correct, how to make LoRA mode run correcly?
Model Parameters: 6.927 B, Latency: 6.08s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.06s, TFLOPs: 1.73, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.73, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
[2023-10-20 09:49:17,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=6, lr=[9.618683345445294e-06, 0.0004983773754116733], mom=[(0.9, 0.95), (0.9, 0.95)]                                                                                                                   
[2023-10-20 09:49:17,004] [INFO] [timer.py:260:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=5.187625237365746, CurrSamplesPerSec=5.008550092748559, MemAllocated=3.9GB, MaxMemAllocated=6.71GB                                                                      
Model Parameters: 6.927 B, Latency: 6.39s, TFLOPs: 1.64, Samples/sec: 0.63, Time/seq 1.60s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.06s, TFLOPs: 1.73, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.09s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.08s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512                                                                                                                                                            
Model Parameters: 6.927 B, Latency: 6.07s, TFLOPs: 1.72, Samples/sec: 0.66, Time/seq 1.52s, Batch Size: 4, Sequence Length: 512 

THULiusj avatar Oct 20 '23 09:10 THULiusj