Megatron-LM
Megatron-LM copied to clipboard
[BUG] Can't continue training from GPT-345M checkpoint with TransformerEngine - RuntimeError: Error(s) in loading state_dict for ParallelTransformer
Describe the bug
While running examples/pretrain_gpt.sh from GPT-345M checkpoint I encounter such error:
[rank0]: RuntimeError: Error(s) in loading state_dict for ParallelTransformer:
[rank0]: Missing key(s) in state_dict: "layers.0.self_attention.layernorm_qkv.layer_norm_weight", ...
[rank0]: Unexpected key(s) in state_dict: "layers.0.input_norm.weight", ...
To Reproduce
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
unzip megatron_lm_345m_v0.0.zip
Run examples/pretrain_gpt.sh. --attention-softmax-in-fp32 arg is added (does not work otherwise).
Also tried llama2 checkpoint. The similar error.
However, the script successfully runs:
- From the zero state, and continues running from locally created checkpoint.
- With
--transformer-impl localfrom provided GPT-345M checkpoint, but that's deprecated, and will not work with llama models per my understanding.
Expected behavior
examples/pretrain_gpt.sh should run fine from GPT-345M checkpoint on the latest release without any modifications.
Stack trace/logs https://gist.github.com/arktoswb/7830a87d514fd53cdad17882128d5122
Environment:
- Megatron-LM commit ID https://github.com/nvidia/Megatron-LM/commit/c3677e09aa4e2eec37048307bd795928b8f8324a
- PyTorch version
$ python -c "import torch; print(torch.__version__)"
2.3.0+cu121
- CUDA version
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
- NCCL version
$ python -c "import torch; print(torch.cuda.nccl.version())"
(2, 20, 5)
- TransformerEngine
I have tried both
stableandrelease_v1.1(related: https://github.com/NVIDIA/Megatron-LM/issues/577)
The same error inside docker PyTorch Release 23.04: https://gist.github.com/arktoswb/d5835a666e7fcf9bfa3d7ff59173299c
Apparently, TransformerEngine is supported with model_type = 'mcore'. So, in order to continue training from GPT-345M checkpoint:
- Convert
python3 tools/checkpoint/convert.py --model-type GPT --loader megatron --saver megatron - Run training with
--use-mcore-models
I will close this issue, but I suggest to edit example scripts and README to make it more clear.
Hello, thanks for suggesting me to use the convert.py, but i still have some problems. Could you please help me take a look at the issue I encountered when using convert?
Here is my code:
python3 tools/checkpoint/convert.py --model-type GPT --loader megatron --saver megatron --load-dir models/megatron_lm_345m_v0.0 --save-dir models/convert/gpt2 --megatron-path /home/zyz/code/Megatron-LM-core_v0.6.0
Which reports:
File "/home/zyz/code/Megatron-LM-core_v0.6.0/tools/checkpoint/loader_megatron.py", line 70, in _load_checkpoint margs, checkpoint_args = load_args_from_checkpoint(margs, exit_on_missing_checkpoint=True) TypeError: cannot unpack non-iterable Namespace object
Can you let me see all the code you're using here?
thanks again
Hello, thanks for suggesting me to use the convert.py, but i still have some problems. Could you please help me take a look at the issue I encountered when using convert? Here is my code:
python3 tools/checkpoint/convert.py --model-type GPT --loader megatron --saver megatron --load-dir models/megatron_lm_345m_v0.0 --save-dir models/convert/gpt2 --megatron-path /home/zyz/code/Megatron-LM-core_v0.6.0Which reports:
File "/home/zyz/code/Megatron-LM-core_v0.6.0/tools/checkpoint/loader_megatron.py", line 70, in _load_checkpoint margs, checkpoint_args = load_args_from_checkpoint(margs, exit_on_missing_checkpoint=True) TypeError: cannot unpack non-iterable Namespace objectCan you let me see all the code you're using here? thanks again
Yeah, there are multiple problems with that:
--loader megatron: you are loading megatron model--saver megatron: you are saving megatron model
Megatron core model is --saver mcore.
As for loading you can make edits to the code and hardcode multiple model parameters to make it work:
- I believe, the error in question is because
load_args_from_checkpointreturns justargsin some conditions instead of returningargs, checkpoint_args. You can bypass it by returningargs, args - Once fixed, you will have more errors because loader does not know all the parameters of the model it needs to know. Add them:
check_for_arg('num_layers', 24)
check_for_arg('hidden_size', 1024)
check_for_arg('num_attention_heads', 16)
check_for_arg('max_position_embeddings', 1024)
check_for_arg('seq_length', 1024)
check_for_arg('tokenizer_type', 'GPT2BPETokenizer')
# Validate margs.
margs = validate_args(margs)
margs.use_mcore_models = False
margs.transformer_impl = args.loader_transformer_impl
check_for_arg('tensor_model_parallel_size')
check_for_arg('pipeline_model_parallel_size')
check_for_arg('position_embedding_type')
check_for_arg('iteration', 666)
check_for_arg('padded_vocab_size', 50304)
check_for_arg('bert_binary_head')
check_for_arg('disable_bias_linear', False)
check_for_arg('params_dtype')
check_for_arg('swiglu', False)
But honestly you will have easier experience loading llama2 7b model. Even better experience on NeMo - it's in a better shape and also supports llama3
Thank you very much for your response, which has been very helpful.
Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.
Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.
Yes, I believe it does.
I stopped working with Megatron months ago, so I am not the best person to ask this question.
Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.
Yes, I believe it does.
I stopped working with Megatron months ago, so I am not the best person to ask this question.
Thanks for replying! I did not find a flag I can specify the PP and TP at convert.py. Do you have any clues on this? Or do you know anyone who may have some clues?
Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.
Yes, I believe it does. I stopped working with Megatron months ago, so I am not the best person to ask this question.
Thanks for replying! I did not find a flag I can specify the PP and TP at convert.py. Do you have any clues on this? Or do you know anyone who may have some clues?
From https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#evaluation-and-tasks: flags --target-tensor-parallel-size and --target-pipeline-parallel-size
Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.
Yes, I believe it does. I stopped working with Megatron months ago, so I am not the best person to ask this question.
Thanks for replying! I did not find a flag I can specify the PP and TP at convert.py. Do you have any clues on this? Or do you know anyone who may have some clues?
From https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#evaluation-and-tasks: flags
--target-tensor-parallel-sizeand--target-pipeline-parallel-size
Thanks! It took some effort to solve the import error, but at the end I encountered another error that says for all layers
untimeError: Error(s) in loading state_dict for ParallelTransformer:
Missing key(s) in state_dict: "layers.0.input_norm.weight", "layers.0.self_attention.query_key_value.weight", "layers.0.self_attention.dense.wei ...
I think it is the problem with how to enable distributed model for Mamba model specifically. I opened up another github issue on this .
But thanks for directing to the convert.py!
Update: convert.py does not support Mamba at the moment, but the hybrid_conversion.py does.
Have you ever tried load HF model and continue train it?