DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] DeepSpeed zero_to_fp32.py script ignores some layers while creating FP32 checkpoints from DS ZeRO checkpoints.

Open rohitdwivedula opened this issue 2 years ago • 13 comments

Problem: Trying to convert DeepSpeed zero checkpoints to PyTorch state_dicts leads to one layer not being present in the generated state dict. I am using the zero_to_fp32.py script. I'm trying to train a GPT2 like model, and it looks like the lm_head (linear layer) of the model is not being correctly included in checkpoints for some reason.

Description

I'm trying to train a GPT2-like model using DeepSpeed (and some code from huggingface/transformers). The model looks something like this:

CustomGPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=50257, bias=False)
)

During the training, after every 0.2 epochs, I attempt to save the model by calling model_engine.save_checkpoint(savedir) on all ranks - which is exactly what the DeepSpeed documentation tells users to do. The generated checkpoint looks something like this:

├── global_step12
│   ├── mp_rank_00_model_states.pt
│   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── zero_pp_rank_1_mp_rank_00_optim_states.pt
│   ├── zero_pp_rank_2_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_3_mp_rank_00_optim_states.pt
├── latest
└── zero_to_fp32.py

After the training is done, I try running:

python zero_to_fp32.py . pytorch_model.bin

Next, when I try loading this checkpoint in Torch, the weights for the lm_head layer are not there:

>>> import torch
>>> state_dict = torch.load("pytorch_model.bin")
>>> state_dict.keys()
odict_keys(['transformer.h.0.attn.bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.bias', 'transformer.h.1.attn.masked_bias', 'transformer.wte.weight', 'transformer.wpe.weight', 'transformer.h.0.ln_1.weight', 'transformer.h.0.ln_1.bias', 'transformer.h.0.attn.c_attn.weight', 'transformer.h.0.attn.c_attn.bias', 'transformer.h.0.attn.c_proj.weight', 'transformer.h.0.attn.c_proj.bias', 'transformer.h.0.ln_2.weight', 'transformer.h.0.ln_2.bias', 'transformer.h.0.mlp.c_fc.weight', 'transformer.h.0.mlp.c_fc.bias', 'transformer.h.0.mlp.c_proj.weight', 'transformer.h.0.mlp.c_proj.bias', 'transformer.h.1.ln_1.weight', 'transformer.h.1.ln_1.bias', 'transformer.h.1.attn.c_attn.weight', 'transformer.h.1.attn.c_attn.bias', 'transformer.h.1.attn.c_proj.weight', 'transformer.h.1.attn.c_proj.bias', 'transformer.h.1.ln_2.weight', 'transformer.h.1.ln_2.bias', 'transformer.h.1.mlp.c_fc.weight', 'transformer.h.1.mlp.c_fc.bias', 'transformer.h.1.mlp.c_proj.weight', 'transformer.h.1.mlp.c_proj.bias', 'transformer.ln_f.weight', 'transformer.ln_f.bias'])

From the above code snippet, we see that the lm_head layer is missing from the keys in the state_dict generated by using zero_to_fp32.py.

More interesting info

Now, when a job runs on N GPUs, there appear to be N+1 checkpoint files (this example was run on 4 GPUs):

  • mp_rank_00_model_states.pt
  • zero_pp_rank_0_mp_rank_00_optim_states.pt
  • zero_pp_rank_1_mp_rank_00_optim_states.pt
  • zero_pp_rank_2_mp_rank_00_optim_states.pt
  • zero_pp_rank_3_mp_rank_00_optim_states.pt

Trying to load and see the internals of the first file, it does look like the lm_head weights are present in the DeepSpeed checkpoint - which might suggest that something is wrong with the zero_to_fp32.py script.

>>> import torch
>>> model_states = torch.load("mp_rank_00_model_states.pt")
>>> "lm_head.weight" in model_states['module'].keys()
True

Reproducing this bug

  1. A minimal working example of this problem can be found in this GitHub Gist.
  2. Download the .py file and the .json from the gist and install python dependencies from dependencies.txt.
  3. Run deepspeed minimal_reproducible_ds.py.
  4. Checkpoints are saved within a results_* directory. Navigate into the directory and run the zero_to_fp32.py script.
  5. The generated state_dict/.bin file will not contain the weights of the lm_head.

Expected Behavior

The state_dict created after using DeepSpeed's zero_to_fp32.py script should have all layers' weights, No layers should be omitted from the state_dict.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/gandiva/rohitd/.venv/lib/python3.6/site-packages/torch']
torch version .................... 1.10.2+cu102
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/home/gandiva/rohitd/.venv/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2, hip 0.0

System info

  • OS: Ubuntu 18.04.6 LTS
  • 4 * 16GB v100 GPUs
  • Python 3.6.9

Additional information

I understand that there is a Huggingface + DS integration already if you use Huggingface's Trainer class. However, all of Huggingface's modules (including the class GPT2LMHead being used here are just subclasses of torch.nn.Module - which, I guess, means they should work even in this code sample (which attempts to use the DeepSpeed API directly).

rohitdwivedula avatar Apr 19 '22 07:04 rohitdwivedula

Hi @rohitdwivedula @tjruwase I encountered with similar problem and I think I have found where the problem is.

This kind of problems may appear when parameter sharing are used.

I think the missing lm_head.weight shares parameters with and is refer to something like word embeddings 's weight ('transformer.wte.weight' here), which is a commly used trick among NLP. Thus there is actually only one parameter hold by 'transformer.wte.weight' and the other one only hold reference.

And deepspeed, when converting checkpoint, recovers fp32 paramters from optimizer, which holds only true parameters, and the references and parameter that hold references are not recovered.

The simplest workaround now is to load the converted checkpoint, and manually set back the missing parameters and their references. e.g. ckpt['state_dict']['lm_head.weight'] = ckpt['state_dict']['transformer.wte.weight']

richarddwang avatar May 12 '22 02:05 richarddwang

This does seem to explain this behavior @richarddwang - I tried checking the base HuggingFace/OpenAI's saved checkpoints (as well as some of my checkpoints on smaller models that I trained w/o DeepSpeed) to see if the layers indeed were the same, and it looks like they are:

> import torch
> from transformers import GPT2LMHeadModel
> model = GPT2LMHeadModel.from_pretrained('gpt2')
> torch.allclose(model.lm_head.state_dict()['weight'], model.transformer.wte.state_dict()['weight'])
True

Thanks a lot for the help!

rohitdwivedula avatar May 12 '22 07:05 rohitdwivedula

@richarddwang, thanks for sharing your great analysis and workaround. It seems we need to extend zero checkpointing to better handle parameter sharing.

tjruwase avatar May 16 '22 12:05 tjruwase

Does it have any updated solutions now?

ZeyiLiao avatar Sep 27 '22 21:09 ZeyiLiao

@ZeyiLiao, apologies we have not had bandwidth to work on this.

tjruwase avatar Sep 27 '22 22:09 tjruwase

@tjruwase , yeah totally understand. So do we have some ad-hoc solutions to it regrading the error: Missing key(s) in state_dict: "encoder.embed_tokens.weight", "decoder.embed_tokens.weight", "lm_head.weight".

I tried

Hi @rohitdwivedula @tjruwase I encountered with similar problem and I think I have found where the problem is.

This kind of problems may appear when parameter sharing are used.

I think the missing lm_head.weight shares parameters with and is refer to something like word embeddings 's weight ('transformer.wte.weight' here), which is a commly used trick among NLP. Thus there is actually only one parameter hold by 'transformer.wte.weight' and the other one only hold reference.

And deepspeed, when converting checkpoint, recovers fp32 paramters from optimizer, which holds only true parameters, and the references and parameter that hold references are not recovered.

The simplest workaround now is to load the converted checkpoint, and manually set back the missing parameters and their references. e.g. ckpt['state_dict']['lm_head.weight'] = ckpt['state_dict']['transformer.wte.weight']

But even can not find transformer.wte.weight at checkpoint after using https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#collating-single-file-checkpoint-for-deepspeed-zero-stage-3

ZeyiLiao avatar Sep 27 '22 22:09 ZeyiLiao

Also experiencing this issue using Stage 2 training w/PyTorch Lightning. Missing a large number of keys when converting back from the deepspeed checkpoint though I can see them when using @rohitdwivedula's suggestion.

alexanderswerdlow avatar Oct 27 '22 01:10 alexanderswerdlow

I think this issue needs re-visiting @tjruwase . This is very much needed for a lot of transformer models

mayank31398 avatar Feb 25 '23 15:02 mayank31398

I'm experiencing a similar issue where calling zero_to_fp32.py after training a partially frozen model results in all frozen layers being dropped.

While a workaround similar to @richarddwang's could work, this behavior is rather unintuitive imho.

linhdvu14 avatar Mar 06 '23 05:03 linhdvu14

Apologies for the delay on this. We are revisiting this issue.

tjruwase avatar Mar 06 '23 11:03 tjruwase

I encounter a similar issue. I continually train a bloom-560m model and convert a saved checkpoint with the zero_to_fp32.py. But when I tried to reload the converted checkpoint without deepspeed:

from transformers import BloomForCausalLM,BloomConfig
configuration = BloomConfig.from_pretrained('/apdcephfs/share_916081/tingchenfu/PLM/bloom-560m')
model = BloomForCausalLM(configuration)
model.load_state_dict(torch.load('/apdcephfs/share_916081/tingchenfu/work1/dump/debug/8step_ckpt/pytorch_model.bin'),strict=True)

not only the lm_head is missing, but there is also a mismatch of parameter shape:

Traceback (most recent call last):
  File "./code/foo.py", line 14, in <module>  model.load_state_dict(torch.load('/apdcephfs/share_916081/tingchenfu/work1/dump/debug/8step_ckpt/pytorch_model.bin'),strict=True)
  File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BloomForCausalLM:
        Missing key(s) in state_dict: "lm_head.weight". 
        size mismatch for transformer.word_embeddings.weight: copying a param with shape torch.Size([250680, 1024]) from checkpoint, the shape in current model is torch.Size([250880, 1024]).

I try to inspect the indeed parameter shape saved by the checkpoint, with

import torch
model_state = torch.load('/apdcephfs/share_916081/tingchenfu/work1/dump/debug/8step_ckpt/pytorch_model/zero_pp_rank_0_mp_rank_00_model_states.pt')
print(model_state['module']['transformer.word_embeddings.weight'].shape)

But I could only get torch.Size([0]). It is of no supervise and I guess this is perhaps due to some ZeRO internal mechanism to save GPU memory and CPU memory. My question is where and how to find the embedding weight of the lost 200 words (250680 vs. 250880)?

TingchenFu avatar Mar 12 '23 06:03 TingchenFu

@TingchenFu The size mismatch looks a bit weird to me. I have not seen this before. The following is how I load it, its a bit unclean but it works fine:

# tied weights are not loaded by DeepSpeed using the above method https://github.com/microsoft/DeepSpeed/issues/1896
if self.model_name.startswith("bigscience/bloom"):
    state["model.lm_head.weight"] = state["model.transformer.word_embeddings.weight"]
elif self.model_name.startswith("google/flan"):
    state["model.encoder.embed_tokens.weight"] = state["model.shared.weight"]
    state["model.decoder.embed_tokens.weight"] = state["model.shared.weight"]

self.load_state_dict(state)

or you can do:

self.load_state_dict(state, strict=False)
self.model.tie_weights()

I have personally not tried the second solution yet. But I have a feeling it should work. tie_weights() is an internal HF transformers method.

mayank31398 avatar Mar 12 '23 07:03 mayank31398

Thanks! @mayank31398
Sorry for the late response. I just tried your recipe and it works:

import torch
from transformers import BloomForCausalLM,BloomConfig
configuration = BloomConfig.from_pretrained('/apdcephfs/share_916081/tingchenfu/PLM/bloom-560m')
model = BloomForCausalLM(configuration)
reloaded = torch.load('/apdcephfs/share_916081/tingchenfu/work1/dump/debug/16step_ckpt/pytorch_model.bin')
reloaded["lm_head.weight"] = reloaded["transformer.word_embeddings.weight"]
model.load_state_dict(reloaded,strict=True)
print("ok")

TingchenFu avatar Mar 14 '23 05:03 TingchenFu

Hi @TingchenFu, @mayank31398, @linhdvu14, @ZeyiLiao, @alexanderswerdlow, @rohitdwivedula, could you please try this PR to see if it fixes this issue.

ShijieZZZZ avatar Mar 17 '23 18:03 ShijieZZZZ

I using zero_to_fp32.py saved float32 lora weigths, but the result are wrong.

However, using model.save_pretrained from hf, with float16, the result is correct. (judging by generate ability).

Any body knows why?

I have tried convert the fp32 to fp16 but still not got normal results.

lucasjinreal avatar Jun 20 '23 03:06 lucasjinreal