Error When Converting Qwen3-30B-A3B-Base Model from HuggingFace to Megatron-Core Format
Description
When attempting to convert a Qwen3-30B-A3B-Base model from HuggingFace format to Megatron-Core format using the converter_hf_to_mcore.py script, the conversion fails with a CUDA stream creation error in the Transformer Engine's CPU offloading module.
Environment
- Python: 3.10 (conda environment: verl-megatron)
- CUDA: 12.3 (nvcc V12.3.103)
- PyTorch: 2.x with CUDA 12.4 support
- GPU: 8x NVIDIA A100-SXM4-80GB
- Transformer Engine: 2.2.0 (as per official requirements)
Steps to Reproduce
- Set up the environment with the required dependencies
- Run the conversion script:
HF_MODEL_PATH=/home/xxx/local_model_weights/Qwen3-30B-A3B-Base
DIST_CKPT_PATH="/home/xxx/local_model_weights/qwen3_30b_dist_ckpt"
python scripts/converter_hf_to_mcore.py --hf_model_path $HF_MODEL_PATH --output_path $DIST_CKPT_PATH
Expected Behavior
The script should successfully convert the HuggingFace model to Megatron-Core format and save it to the specified output path.
Actual Behavior
The script fails with the following error:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/xxx/code/verl/scripts/converter_hf_to_mcore.py", line 235, in <module>
[rank0]: convert_hf_to_mcore(args.hf_model_path, args.output_path, args.use_cpu_initialization, args.test)
[rank0]: File "/home/xxx/code/verl/scripts/converter_hf_to_mcore.py", line 192, in convert_hf_to_mcore
[rank0]: model = get_model(
[rank0]: File "/home/xxx/code/verl/verl/utils/megatron_utils.py", line 82, in get_model
[rank0]: model = model_provider_func(pre_process=pre_process, post_process=post_process)
[rank0]: File "/home/xxx/code/verl/scripts/converter_hf_to_mcore.py", line 182, in megatron_model_provider
[rank0]: parallel_model = init_mcore_model(
[rank0]: File "/home/xxx/code/verl/verl/models/mcore/registry.py", line 160, in init_mcore_model
[rank0]: return initializer.initialize(pre_process=pre_process, post_process=post_process, share_embeddings_and_output_weights=share_embeddings_and_output_weights, value=value, **extra_kwargs)
[rank0]: File "/home/zjc/code/verl/verl/models/mcore/model_initializer.py", line 151, in initialize
[rank0]: model = super().initialize(**kwargs)
[rank0]: File "/home/xxx/code/verl/verl/models/mcore/model_initializer.py", line 71, in initialize
[rank0]: model = GPTModel(
[rank0]: File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/megatron/core/models/gpt/gpt_model.py", line 169, in __init__
[rank0]: self.decoder = TransformerBlock(
[rank0]: File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/megatron/core/transformer/transformer_block.py", line 242, in __init__
[rank0]: get_cpu_offload_context(
[rank0]: File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/megatron/core/extensions/transformer_engine.py", line 1329, in get_cpu_offload_context
[rank0]: context, sync_func = _get_cpu_offload_context(
[rank0]: File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/transformer_engine/pytorch/cpu_offload.py", line 595, in get_cpu_offload_context
[rank0]: cpu_offload_handler = AsyncDoubleBufferGroupOffloadHandler(
[rank0]: File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/transformer_engine/pytorch/cpu_offload.py", line 328, in __init__
[rank0]: self.d2h_stream = torch.cuda.Stream()
[rank0]: File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/torch/cuda/streams.py", line 34, in __new__
[rank0]: return super().__new__(cls, priority=priority, **kwargs)
[rank0]: RuntimeError: CUDA error: operation not supported
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Additional Information
- CUDA environment appears to be correctly configured - running a simple CUDA stream creation test works fine:
import torch
stream = torch.cuda.Stream() # This works without error
- Attempted workarounds that did NOT resolve the issue:
- Setting environment variables to disable CPU offloading:
export NVTE_CPU_OFFLOAD_SIZE=0 export TRANSFORMER_ENGINE_NO_CPU_OFFLOAD=1 - Using the
--use_cpu_initializationflag
- Setting environment variables to disable CPU offloading:
Any guidance on resolving this issue would be greatly appreciated. Thank you!
I cannot reproduce it, Could you try it with official docker to avoid some env issue?
Me, too. I try to set
HAVE_TE = False
get_cpu_offload_context = None
at the try except block at the beginning of the Megatron-LM/megatron/core/transformer/transformer_block.py. Then it will send and receive layers normally. But at the end, getting the model.state.dict() is not working, raising RuntimeError: CUDA error: operation not supported again. It might be the nvidia driver version problem i guess.
I also met this CUDA stream error. Adding one line torch.multiprocessing.set_start_method('spawn') can solve this error.
See here: https://github.com/NVIDIA/Megatron-LM/issues/1132#issuecomment-3021533022
@jinyouzhi I use official docker image verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2 to convert Qwen2.5-7b-Instruct, also meet the same error.
Overridden TF init config: {'num_layers': 28, 'hidden_size': 3584, 'num_attention_heads': 28, 'num_query_groups': 4, 'ffn_hidden_size': 18944, 'attention_dropout': 0.0, 'hidden_dropout': 0.0, 'kv_channels': None, 'layernorm_epsilon': 1e-06, 'activation_func': <function silu at 0x7f9f9effa440>, 'normalization': 'RMSNorm', 'gated_linear_unit': True, 'pipeline_dtype': torch.bfloat16, 'params_dtype': torch.bfloat16, 'bf16': True, 'tensor_model_parallel_size': 1, 'pipeline_model_parallel_size': 1, 'expert_model_parallel_size': 1, 'expert_tensor_parallel_size': 1, 'virtual_pipeline_model_parallel_size': None, 'context_parallel_size': 1, 'overlap_p2p_comm': False, 'batch_p2p_comm': False, 'sequence_parallel': False, 'variable_seq_lengths': True, 'masked_softmax_fusion': True, 'moe_token_dispatcher_type': 'alltoall', 'use_cpu_initialization': False, 'add_bias_linear': False, 'add_qkv_bias': True, 'qk_layernorm': False}
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/verl/scripts/converter_hf_to_mcore.py", line 303, in <module>
[rank0]: convert_hf_to_mcore(args.hf_model_path, args.output_path, args.use_cpu_initialization, args.test, args.trust_remote_code)
[rank0]: File "/workspace/verl/scripts/converter_hf_to_mcore.py", line 258, in convert_hf_to_mcore
[rank0]: model = get_model(
[rank0]: File "/workspace/verl/verl/utils/megatron_utils.py", line 82, in get_model
[rank0]: model = model_provider_func(pre_process=pre_process, post_process=post_process)
[rank0]: File "/workspace/verl/scripts/converter_hf_to_mcore.py", line 248, in megatron_model_provider
[rank0]: parallel_model = init_mcore_model(
[rank0]: File "/workspace/verl/verl/models/mcore/registry.py", line 164, in init_mcore_model
[rank0]: return initializer.initialize(pre_process=pre_process, post_process=post_process, share_embeddings_and_output_weights=share_embeddings_and_output_weights, value=value, **extra_kwargs)
[rank0]: File "/workspace/verl/verl/models/mcore/model_initializer.py", line 71, in initialize
[rank0]: model = GPTModel(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/megatron/core/models/gpt/gpt_model.py", line 132, in __init__
[rank0]: self.embedding = LanguageModelEmbedding(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/megatron/core/models/common/embeddings/language_model_embedding.py", line 53, in __init__
[rank0]: self.word_embeddings = tensor_parallel.VocabParallelEmbedding(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/megatron/core/tensor_parallel/layers.py", line 231, in __init__
[rank0]: torch.empty(
[rank0]: RuntimeError: CUDA error: operation not supported
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank0]:[W723 14:55:52.846129107 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
#2663