Megatron-LM [BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

Hi, I tried to finetune Llama2-7b-chat model using megatron. I downloaded the hf checkpoint and convert it to GPT megatron checkpoint referring https://github.com/NVIDIA/Megatron-LM/blob/fe1640a3cc4866e015bfdb6449f0d1943d2243cb/docs/llama_mistral.md?plain=1#L73. The command I used is:

python tools/checkpoint/convert.py \
    --model-type GPT \
    --loader llama_mistral \
    --saver megatron \
    --target-tensor-parallel-size 1 \
    --checkpoint-type hf \
    --model-size llama2-7Bf \
    --load-dir Llama-2-7b-chat-hf \
    --save-dir ./Llama-2-7b-chat-pp1 \
    --tokenizer-model Llama-2-7b-chat-hf/tokenizer.model

Then I tried to train the llama:

#!/bin/bash

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE 
    --nnodes $NUM_NODES 
    --master_addr $MASTER_ADDR 
    --master_port $MASTER_PORT
)

GPT_MODEL_ARGS=(
    --num-layers ${NUM_LAYERS} 
    --hidden-size ${HIDDEN_SIZE} 
    --num-attention-heads ${NUM_HEAD} 
    --ffn-hidden-size ${FFN_HIDDEN_SIZE} 
    --position-embedding-type rope 
    --max-position-embeddings ${MAX_POSITION_EMBEDDINGS} 
    --seq-length 4096 
    --max-position-embeddings 4096 
)

TRAINING_ARGS=(
    --micro-batch-size 1 
    --global-batch-size 32 
    --train-iters 50 
    --weight-decay 0.1 
    --adam-beta1 0.9 
    --adam-beta2 0.95 
    --init-method-std 0.006 
    --clip-grad 1.0 
    --bf16
    --lr 6.0e-5 
    --lr-decay-style cosine 
    --min-lr 6.0e-6
    --lr-warmup-fraction .001 
    --lr-decay-iters 30 
    --no-load-rng 
    --no-load-optim
    --exit-on-missing-checkpoint
    --use-checkpoint-args 
    --untie-embeddings-and-output-weights 
    --use-rotary-position-embeddings
    --use-flash-attn 
    --no-position-embedding
    --no-masked-softmax-fusion
    --attention-softmax-in-fp32
)

MODEL_PARALLEL_ARGS=(
	--tensor-model-parallel-size 1
	--pipeline-model-parallel-size 1
)

DATA_ARGS=(
    --data-path $DATA_PATH 
    --split 949,50,1
    --tokenizer-model ${TOKENIZER_PATH}
    --data-cache-path ./data_cache 
    --tokenizer-type Llama2Tokenizer
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 1
    --save-interval 5 
    --eval-interval 5 
    --save $CHECKPOINT_PATH 
    --load $CHECKPOINT_PATH 
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH 
)

torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]}

I met the following error:

RuntimeError: Error(s) in loading state_dict for GPTModel:
	Missing key(s) in state_dict: "embedding.word_embeddings.weight", "decoder.layers.0.self_attention.linear_proj.weight",...
Unexpected key(s) in state_dict: "language_model".

How to solve this problem?

Sep 11 '24 09:09 mxjmtxrm

@mxjmtxrm , our instructions could be clearer in these docs regarding the compatibility between the converter's --saver arg and the training model format. There are two model formats, legacy (a.k.a., 'megatron') and mcore. In the docs and in your command above, --saver megatron saves to the legacy format, but during training, the default format is mcore, unless otherwise specified. There are two options for your issue:

If possible for your use, save to the newer mcore format by specifying --saver mcore during conversion.
If you are tied to using the legacy/megatron format, then in your training script, you must add the arg --use-legacy-models to train using the legacy format (rather than the default mcore format).

Let me know if you have any questions.

Sep 18 '24 18:09 lmcafee-nvidia

When i repeat this issue, and use "--use-legacy-models" in script.

An error

"File "/workspace/Megatron-LM/megatron/training/arguments.py", line 576, in validate_args raise RuntimeError('--use-dist-ckpt is not supported in legacy models.') RuntimeError: --use-dist-ckpt is not supported in legacy models. "

was occurd.

Sep 27 '24 07:09 carolove

This error occurred because the default value for --ckpt-format is torch_dist, which means that the distributed checkpoint format is used, which is incompatible with the legacy model format. Again, there are two options to fix this:

During conversion, does the use of --saver mcore work for your use case? If so, convert your model again with that arg, and then training should run fine.
Again, if you are indeed tied to using the legacy model format (i.e., must convert using --saver megatron), then try using both of these args when launching training:
- --use-legacy-models
- --ckpt-format torch

Sep 27 '24 18:09 lmcafee-nvidia

OK,thank you for reply,I will test it at the next Monday

Sep 28 '24 03:09 carolove

@lmcafee-nvidia I try to use --saver mcore , but it will hit another error. Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/workspace/megatron/tools/checkpoint/saver_mcore.py", line 555, in save_checkpoint model = get_local_model(0, ep_rank, tp_rank) File "/workspace/megatron/tools/checkpoint/saver_mcore.py", line 548, in get_local_model models[pp_rank][ep_rank][tp_rank] = model_provider(pre_process, post_process).to(md.params_dtype) File "/workspace/megatron/pretrain_gpt.py", line 96, in model_provider model = GPTModel( File "/workspace/megatron/megatron/core/models/gpt/gpt_model.py", line 120, in __init__ self.decoder = TransformerBlock( File "/workspace/megatron/megatron/core/transformer/transformer_block.py", line 204, in __init__ get_cpu_offload_context( File "/workspace/megatron/megatron/core/extensions/transformer_engine.py", line 1079, in get_cpu_offload_context context, sync_func = _get_cpu_offload_context( File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpu_offload.py", line 508, in get_cpu_offload_context cpu_offload_handler = AsyncDoubleBufferGroupOffloadHandler( File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpu_offload.py", line 274, in __init__ self.d2h_stream = torch.cuda.Stream() File "/usr/local/lib/python3.10/dist-packages/torch/cuda/streams.py", line 34, in __new__ return super().__new__(cls, priority=priority, **kwargs) RuntimeError: CUDA error: initialization error Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

sending transformer layer 18 sending transformer layer 19 `

Oct 27 '24 17:10 TeddLi

I did used --saver megatron. And the conversion could be done without problem

Oct 27 '24 17:10 TeddLi

But I hit the same problem when train legacy model

Oct 27 '24 17:10 TeddLi

Just to be clear, did your error above happen during conversion or during training? The extra lines at the bottom showing sending transformer layer ... indicate that this is a conversion error?

Oct 28 '24 17:10 lmcafee-nvidia

@lmcafee-nvidia During conversion.

Oct 28 '24 20:10 TeddLi

@lmcafee-nvidia Megatron conversion works. But during training, we hit the exact error as this post. So we change the conversion type to --saver mcore. But the conversion couldn't finish. We are kinda stuck in there

Oct 28 '24 20:10 TeddLi

@lmcafee-nvidia Just another update, We also tried this two flag. None of the solution you provide works for us --use-legacy-models --ckpt-format torch It still hit the state_dict error:

[rank4]: Traceback (most recent call last):                                                                                                                                               
[rank4]:   File "/workspace/megatron/./pretrain_gpt.py", line 265, in <module>                                                                                                            
[rank4]:     pretrain(                                                                                                                                                                    
[rank4]:   File "/workspace/megatron/megatron/training/training.py", line 301, in pretrain                                                                                                
[rank4]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(                                                                                                           
[rank4]:   File "/workspace/megatron/megatron/training/training.py", line 680, in setup_model_and_optimizer                                                                               
[rank4]:     args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(                                                                                                 
[rank4]:   File "/workspace/megatron/megatron/training/checkpointing.py", line 1152, in load_checkpoint                                                                                   
[rank4]:     model[0].load_state_dict(state_dict['model'], strict=strict)                                                                                                                 
[rank4]:   File "/workspace/megatron/megatron/legacy/model/gpt_model.py", line 122, in load_state_dict                                                                                    
[rank4]:     self.language_model.load_state_dict(state_dict, strict=strict)                                                                                                               
[rank4]:   File "/workspace/megatron/megatron/legacy/model/language_model.py", line 633, in load_state_dict                                                                               
[rank4]:     self.encoder.load_state_dict(state_dict_, strict=strict)                                                                                                                     
[rank4]:   File "/workspace/megatron/megatron/legacy/model/transformer.py", line 1803, in load_state_dict                                                                                 
[rank4]:     super().load_state_dict(state_dict_, strict)                                                                                                                                 
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2584, in load_state_dict                                                                       
[rank4]:     raise RuntimeError(                                                                                                                                                          
[rank4]: RuntimeError: Error(s) in loading state_dict for ParallelTransformer:                                                                                                            
[rank4]:        Missing key(s) in state_dict: "layers.0.self_attention.layernorm_qkv.layer_norm_weight",

Nov 02 '24 19:11 TeddLi

@TeddLi You should use spawn as the start method for torch multiprocessing, otherwise CUDA context cannot be properly set up. A simple way to fix it is just add torch.multiprocessing.set_start_method('spawn') in https://github.com/NVIDIA/Megatron-LM/blob/main/tools/checkpoint/convert.py before invoking subprocesses.

Nov 04 '24 13:11 zshCuanNi

@lmcafee-nvidia I believe the missing line torch.multiprocessing.set_start_method('spawn') in https://github.com/NVIDIA/Megatron-LM/blob/main/tools/checkpoint/convert.py is a bug.

To be more specific, when initializing an mcore model, the model_provider in mcore saver and loader will use TransformerEngine as the backend implementation. During this process, some CUDAstreams are allocated via torch.cuda.Stream() in https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/extensions/transformer_engine.py and some internal code in the TransformerEngine package. This requires CUDA context to be properly set up. As mentioned in https://github.com/pytorch/pytorch/issues/2517, adding mp.set_start_method('spawn') is a workaround to ensure that the CUDA context is initialized correctly for each subprocess..

Nov 05 '24 08:11 zshCuanNi

@zshCuanNi , thanks for your suggestion regarding spawn, though in internal testing we have not encountered any issues related to this. Perhaps something is different in our environment setups, but I'm not sure.

@TeddLi @zshCuanNi , would it be possible for you to setup a reproducible example for each of your issues, using publicly available checkpoints such as the ones listed here: https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#downloading-checkpoints. If you could do that and provide your launch command, that would help me look into this. Thanks.

Nov 06 '24 15:11 lmcafee-nvidia

@TeddLi You should use spawn as the start method for torch multiprocessing, otherwise CUDA context cannot be properly set up. A simple way to fix it is just add torch.multiprocessing.set_start_method('spawn') in https://github.com/NVIDIA/Megatron-LM/blob/main/tools/checkpoint/convert.py before invoking subprocesses.

@zshCuanNi Thanks for the insightful response. Just wondering if you mind if I send you a DM? I send a msg to you email end with @pku.edu.cn,

Nov 09 '24 16:11 TeddLi

Let's keep the discussion of github for now. Did you consider making a reproducible example? If you setup a script based on a public checkpoint, I can try to debug your issue.

Nov 13 '24 18:11 lmcafee-nvidia

Let's keep the discussion of github for now. Did you consider making a reproducible example? If you setup a script based on a public checkpoint, I can try to debug your issue.

Yes, I will make a reproducible example. I just used a random lamda2 checkpoint, will update this page later

Nov 13 '24 19:11 TeddLi

@lmcafee-nvidia Sure! Below is a reproducible example. I tested on the up-to-date main branch of Megatron-LM, H100 80G on Azure cloud, torch==2.5.1+cu124, transformer-engine=1.13.0.dev0. I used Llama3-8B checkpoint downloaded from huggingface and ran the script below:

python tools/checkpoint/convert.py \
    --model-type GPT \
    --model-size llama3-8B \
    --loader llama_mistral \
    --saver mcore \
    --checkpoint-type hf \
    --load-dir ${HF_FORMAT_DIR} \
    --save-dir ${MG_FORMAT_DIR} \
    --tokenizer-model ${HF_FORMAT_DIR} \
    --target-tensor-parallel-size ${tp} \
    --target-pipeline-parallel-size ${pp}

I encountered the error below:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "PATH/TO/Megatron-LM/tools/checkpoint/saver_mcore.py", line 555, in save_checkpoint
    model = get_local_model(0, ep_rank, tp_rank)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TO/Megatron-LM/tools/checkpoint/saver_mcore.py", line 548, in get_local_model
    models[pp_rank][ep_rank][tp_rank] = model_provider(pre_process, post_process).to(md.params_dtype)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATH/TO/Megatron-LM/pretrain_gpt.py", line 96, in model_provider
    model = GPTModel(
            ^^^^^^^^^
  File "PATH/TO/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 120, in __init__
    self.decoder = TransformerBlock(
                   ^^^^^^^^^^^^^^^^^
  File "PATH/TO/Megatron-LM/megatron/core/transformer/transformer_block.py", line 204, in __init__
    get_cpu_offload_context(
  File "PATH/TO/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 1075, in get_cpu_offload_context
    context, sync_func = _get_cpu_offload_context(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py", line 504, in get_cpu_offload_context
    cpu_offload_handler = AsyncDoubleBufferGroupOffloadHandler(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py", line 314, in __init__
    self.d2h_stream = torch.cuda.Stream()
                      ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/cuda/streams.py", line 35, in __new__
    return super().__new__(cls, priority=priority, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Nov 14 '24 08:11 zshCuanNi

thanks @zshCuanNi for your script. I'll run it and get back to you

Dec 10 '24 16:12 lmcafee-nvidia

@zshCuanNi , I didn't see any errors when I ran your conversion command above on the Llama 3 8B model.

I tested your conversion script for a few different NGC containers from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers with the closest torch/transformer_engine versions you mentioned are:

24.10-py3 : torch==2.5.0a0+e000cf0ad9.nv24.10, transformer_engine==1.11.0+4df8488
24.11-py3 : torch==2.6.0a0+df5bbc09d1.nv24.11, transformer_engine==1.12.0+7f2afaa

I wasn't able to test your exact versions because I couldn't either 1) pip install them, or 2) find those exact tags on their Github repos.

Nevertheless, everything is working smoothly on my end. Maybe you could try an NGC container?

Dec 10 '24 21:12 lmcafee-nvidia

Marking as stale. No activity in 60 days.

Feb 09 '25 18:02 github-actions[bot]

@TeddLi You should use spawn as the start method for torch multiprocessing, otherwise CUDA context cannot be properly set up. A simple way to fix it is just add torch.multiprocessing.set_start_method('spawn') in https://github.com/NVIDIA/Megatron-LM/blob/main/tools/checkpoint/convert.py before invoking subprocesses.

I also met this CUDA stream error. Adding torch.multiprocessing.set_start_method('spawn') can solve this error. Thanks!

Jul 01 '25 02:07 qijiaxing

Closing as resolved. As verified in this comment, the conversion works correctly with proper parameters.

Issue: --saver megatron creates legacy format checkpoints, but the training script expects Megatron Core format, causing state_dict key mismatches.

Solution: Use --saver core during conversion

Oct 12 '25 19:10 sbhavani

Megatron-LM Megatron-LM copied to clipboard

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.

Megatron-LM
Megatron-LM copied to clipboard