ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

why GeminiPlugin zero3+offloading cannot training a 7B model

Open SeekPoint opened this issue 8 months ago • 1 comments

I got resource with 1T cpu mem and 4 2080ti22GB cards

I try zero3+offloading like

  •    plugin = GeminiPlugin(precision=args.mixed_precision, 
    
  •                          initial_scale=2**16, 
    
  •                          shard_param_frac = 1,
    
  •                          offload_optim_frac = 1,
    
  •                          offload_param_frac =1,
    
  •                          tp_size =4,
    
  •                          max_norm=args.grad_clip
    

with a 7B model Llama2-Chinese-7b-Chat-ms but it report GPU OOM

SeekPoint avatar Apr 21 '25 04:04 SeekPoint

You might need to turn on Gradient Checkpoint

eiPI1-0 avatar Jul 23 '25 07:07 eiPI1-0