Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

[WIP] [fp32 checkpoint] very early experiments with extracting fp32 params

Open stas00 opened this issue 2 years ago • 0 comments

I just started to look at how to adapt zero_to_fp32.py to extract fp32 weights from optimizer states.

I will park this for now since it was said today fp16 weights are sufficient, but if we need to resume down the road this PR will already have a good starting point.

# convert fp32
./zero_to_fp32.py checkpoints/gpt2/global_step100 checkpoints/gpt2/pytorch.bin

# convert fp16
PYTHONPATH=/hf/Megatron-DeepSpeed-master:/hf/Megatron-DeepSpeed-microsoft python tools/convert_checkpoint/deepspeed_to_transformers.py  \
--input_folder /hf/Megatron-DeepSpeed-master/checkpoints/gpt2/global_step100 \
--output_folder /hf/Megatron-DeepSpeed-master/checkpoints/gpt2/

This happened to be a PP=2/TP=1 checkpoint.

then

import torch
from pprint import pprint
f1 = "/hf/Megatron-DeepSpeed-master/checkpoints/gpt2/pytorch.bin"
f2 = "/hf/Megatron-DeepSpeed-master/checkpoints/gpt2/pytorch_model.bin"
sd1 = torch.load(f1)
sd2 = torch.load(f2)

len(sd1.keys())
len(sd2.keys())

pprint([x for x in sd1.keys()])
pprint([x for x in sd2.keys()])
14
Out[26]:
33
['tied_modules.embed.word_embeddings.weight',
 'tied_modules.embed.position_embeddings.weight',
 '3.input_layernorm.weight',
 '3.input_layernorm.bias',
 '3.self_attention.query_key_value.weight',
 '3.self_attention.query_key_value.bias',
 '3.self_attention.dense.weight',
 '3.self_attention.dense.bias',
 '3.post_attention_layernorm.weight',
 '3.post_attention_layernorm.bias',
 '3.mlp.dense_h_to_4h.weight',
 '3.mlp.dense_h_to_4h.bias',
 '3.mlp.dense_4h_to_h.weight',
 '3.mlp.dense_4h_to_h.bias']
['transformer.wte.weight',
 'transformer.wpe.weight',
 'transformer.h.0.ln_1.weight',
 'transformer.h.0.ln_1.bias',
 'transformer.h.0.attn.bias',
 'transformer.h.0.attn.masked_bias',
 'transformer.h.0.attn.c_attn.weight',
 'transformer.h.0.attn.c_attn.bias',
 'transformer.h.0.attn.c_proj.weight',
 'transformer.h.0.attn.c_proj.bias',
 'transformer.h.0.ln_2.weight',
 'transformer.h.0.ln_2.bias',
 'transformer.h.0.mlp.c_fc.weight',
 'transformer.h.0.mlp.c_fc.bias',
 'transformer.h.0.mlp.c_proj.weight',
 'transformer.h.0.mlp.c_proj.bias',
 'transformer.h.1.ln_1.weight',
 'transformer.h.1.ln_1.bias',
 'transformer.h.1.attn.bias',
 'transformer.h.1.attn.masked_bias',
 'transformer.h.1.attn.c_attn.weight',
 'transformer.h.1.attn.c_attn.bias',
 'transformer.h.1.attn.c_proj.weight',
 'transformer.h.1.attn.c_proj.bias',
 'transformer.h.1.ln_2.weight',
 'transformer.h.1.ln_2.bias',
 'transformer.h.1.mlp.c_fc.weight',
 'transformer.h.1.mlp.c_fc.bias',
 'transformer.h.1.mlp.c_proj.weight',
 'transformer.h.1.mlp.c_proj.bias',
 'transformer.ln_f.weight',
 'transformer.ln_f.bias',
 'lm_head.weight']

You get very different results for sd1 with TP=2/PP=1

So we have 2 work out 2 different cases - TP>1 and PP>1

Z1 uses the same parititioning as Z2 so we can reuse already a lot of the original code I wrote for Z2.

so stopped here for now...

If someone would like to work on this I think it'd be a useful feature to have.

stas00 avatar Sep 21 '21 21:09 stas00