gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

Deepspeed zero optimizer, error converting model checkpoints

Open MatejUlcar opened this issue 2 years ago • 2 comments

Describe the bug Unable to convert a custom gpt-neox model (with zero stage 3) checkpoints using zero_to_fp32.py script.

To Reproduce Train a model with zero stage 3, pp=0, mp=1 (haven't attempted other combinations). Save checkpoint. Run the zero_to_fp32.py script, attached to the saved checkpoint.

Environment (please complete the following information):

  • GPUs: training on 4x A100, convert attempted on: same as training, 4x 2080ti, 1x 2080ti, cpu
  • Configs: 12 layer gpt, zero stage 3, pp=0, mp=1

Attempted so far Run as is:

Detected checkpoint of type zero stage 3, world_size: 4
Traceback (most recent call last):
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 151, in <module>
    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 121, in convert_zero_chkpt_to_fp32_consolid_state_dict
    state_dict[name] = torch.cat(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper___cat)

Limit env to 1 GPU:

Traceback (most recent call last):
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 151, in <module>
    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 83, in convert_zero_chkpt_to_fp32_consolid_state_dict
    zero_stage, world_size, param_shapes, fp32_flat_groups = parse_optim_states(optim_files)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 39, in parse_optim_states
    state_dicts.append(torch.load(f))
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 857, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 846, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/mulcar/.conda/envs/gptneox/lib/python3.9/site-packages/torch/serialization.py", line 142, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on CUDA device '
RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.

Adding map_location argument to torch.load as per above error

Detected checkpoint of type zero stage 3, world_size: 4
Traceback (most recent call last):
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 151, in <module>
    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 122, in convert_zero_chkpt_to_fp32_consolid_state_dict
    tuple(fp32_flat_groups[i].narrow(0,
  File "/home/mulcar/gpt-neox/checkpoints-sl-small/./zero_to_fp32.py", line 122, in <genexpr>
    tuple(fp32_flat_groups[i].narrow(0,
RuntimeError: start (27432576) + length (6168576) exceeds dimension size (33570816).

I have also attempted to convert with deepspeed's latest script, but I got an error that it's not a model state checkpoint, so I figure the differences between current deeperspeed and latest deepspeed are too great to be of any use. Please advice on how I could salvage, ie. actually use the model for inference/evaluation other than training again from scratch.

MatejUlcar avatar Mar 20 '22 15:03 MatejUlcar

Hmmm. We haven’t really been using ZeRO 3, as our testing indicates that it’s a big step up in complexity for a minimal increase in performance. It’s possible we broke something without realizing it.

I’ll have to train a model and play with it a bit, can you post the config file? Alternatively, would you be able to share the trained model by any chance? Hopefully we can get this resolved.

StellaAthena avatar Mar 20 '22 16:03 StellaAthena

I've figured it's less bother to train from scratch without ZeRO 3, seeing how GPUs were under-utilized in the original run. Either way, I can share the trained model https://drive.google.com/drive/folders/13Z4g4eGFd33yhI2HWj4EM4G6lGw2W24D?usp=sharing The config files included in the archive, the model config here:

{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 0,
   "model-parallel-size": 1,

   # model settings
   "num-layers": 12,
   "hidden-size": 768,
   "num-attention-heads": 12,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,
    # this should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": false,
   "train-iters": 80000,

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 1.0e-5,
       #"freeze_step": 5000,
       "betas": [0.9, 0.999],
       #"cuda_aware": false,
       #"comm_backend_name": "nccl"
     }
   },

   "zero_optimization": {
    "stage": 3,
    "allgather_partitions": True,
    "allgather_bucket_size": 100000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 100000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },
  "zero_allow_untested_optimizer": true,

   # batch / data settings
   "train_micro_batch_size_per_gpu": 8,
   "gradient_accumulation_steps": 4,
   "data-impl": "mmap",
   "split": "949,50,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0.05,
   "hidden-dropout": 0.1,
   "attention-dropout": 0.1,

   # precision settings
   "fp16": { 
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 500,
     "hysteresis": 2,
     "min_loss_scale": 1,
   },

   # lr decay settings
   "lr-decay-iters": 80000,
   "lr-decay-style": "cosine",
   "warmup": 0.01,
  
   # misc. training settings
   "distributed-backend": "nccl",
   #"save-interval": 10000,
   #"eval-interval": 1000,
   "save-interval": 500,
   "eval-interval": 100,
   "eval-iters": 10,

   # logging
   #"log-interval": 100,
   "log-interval": 10,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 4,
   "wall_clock_breakdown": true,

  # sparse attention
  #"attention_config": [[["local", "global"], "all"]],
}

MatejUlcar avatar Apr 06 '22 21:04 MatejUlcar

We do not currently support ZeRO 3, which seems to be the core source of your issue. Closing for now.

StellaAthena avatar Sep 18 '22 15:09 StellaAthena