verl Error When Converting Qwen3-30B-A3B-Base Model from HuggingFace to Megatron-Core Format

Description

When attempting to convert a Qwen3-30B-A3B-Base model from HuggingFace format to Megatron-Core format using the converter_hf_to_mcore.py script, the conversion fails with a CUDA stream creation error in the Transformer Engine's CPU offloading module.

Environment

Python: 3.10 (conda environment: verl-megatron)
CUDA: 12.3 (nvcc V12.3.103)
PyTorch: 2.x with CUDA 12.4 support
GPU: 8x NVIDIA A100-SXM4-80GB
Transformer Engine: 2.2.0 (as per official requirements)

Steps to Reproduce

Set up the environment with the required dependencies
Run the conversion script:

HF_MODEL_PATH=/home/xxx/local_model_weights/Qwen3-30B-A3B-Base
DIST_CKPT_PATH="/home/xxx/local_model_weights/qwen3_30b_dist_ckpt"
python scripts/converter_hf_to_mcore.py --hf_model_path $HF_MODEL_PATH --output_path $DIST_CKPT_PATH

Expected Behavior

The script should successfully convert the HuggingFace model to Megatron-Core format and save it to the specified output path.

Actual Behavior

The script fails with the following error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/xxx/code/verl/scripts/converter_hf_to_mcore.py", line 235, in <module>
[rank0]:     convert_hf_to_mcore(args.hf_model_path, args.output_path, args.use_cpu_initialization, args.test)
[rank0]:   File "/home/xxx/code/verl/scripts/converter_hf_to_mcore.py", line 192, in convert_hf_to_mcore
[rank0]:     model = get_model(
[rank0]:   File "/home/xxx/code/verl/verl/utils/megatron_utils.py", line 82, in get_model
[rank0]:     model = model_provider_func(pre_process=pre_process, post_process=post_process)
[rank0]:   File "/home/xxx/code/verl/scripts/converter_hf_to_mcore.py", line 182, in megatron_model_provider
[rank0]:     parallel_model = init_mcore_model(
[rank0]:   File "/home/xxx/code/verl/verl/models/mcore/registry.py", line 160, in init_mcore_model
[rank0]:     return initializer.initialize(pre_process=pre_process, post_process=post_process, share_embeddings_and_output_weights=share_embeddings_and_output_weights, value=value, **extra_kwargs)
[rank0]:   File "/home/zjc/code/verl/verl/models/mcore/model_initializer.py", line 151, in initialize
[rank0]:     model = super().initialize(**kwargs)
[rank0]:   File "/home/xxx/code/verl/verl/models/mcore/model_initializer.py", line 71, in initialize
[rank0]:     model = GPTModel(
[rank0]:   File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/megatron/core/models/gpt/gpt_model.py", line 169, in __init__
[rank0]:     self.decoder = TransformerBlock(
[rank0]:   File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/megatron/core/transformer/transformer_block.py", line 242, in __init__
[rank0]:     get_cpu_offload_context(
[rank0]:   File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/megatron/core/extensions/transformer_engine.py", line 1329, in get_cpu_offload_context
[rank0]:     context, sync_func = _get_cpu_offload_context(
[rank0]:   File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/transformer_engine/pytorch/cpu_offload.py", line 595, in get_cpu_offload_context
[rank0]:     cpu_offload_handler = AsyncDoubleBufferGroupOffloadHandler(
[rank0]:   File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/transformer_engine/pytorch/cpu_offload.py", line 328, in __init__
[rank0]:     self.d2h_stream = torch.cuda.Stream()
[rank0]:   File "/root/anaconda3/envs/verl-megatron/lib/python3.10/site-packages/torch/cuda/streams.py", line 34, in __new__
[rank0]:     return super().__new__(cls, priority=priority, **kwargs)
[rank0]: RuntimeError: CUDA error: operation not supported
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Additional Information

CUDA environment appears to be correctly configured - running a simple CUDA stream creation test works fine:

import torch
stream = torch.cuda.Stream()  # This works without error

Attempted workarounds that did NOT resolve the issue:
- Setting environment variables to disable CPU offloading:
```
export NVTE_CPU_OFFLOAD_SIZE=0
export TRANSFORMER_ENGINE_NO_CPU_OFFLOAD=1
```
- Using the --use_cpu_initialization flag

Any guidance on resolving this issue would be greatly appreciated. Thank you!

May 27 '25 14:05 jc-ryan

I cannot reproduce it, Could you try it with official docker to avoid some env issue?

May 27 '25 16:05 jinyouzhi

Me, too. I try to set

HAVE_TE = False
get_cpu_offload_context = None

at the try except block at the beginning of the Megatron-LM/megatron/core/transformer/transformer_block.py. Then it will send and receive layers normally. But at the end, getting the model.state.dict() is not working, raising RuntimeError: CUDA error: operation not supported again. It might be the nvidia driver version problem i guess.

Jun 01 '25 07:06 XuanofXXX

I also met this CUDA stream error. Adding one line torch.multiprocessing.set_start_method('spawn') can solve this error. See here: https://github.com/NVIDIA/Megatron-LM/issues/1132#issuecomment-3021533022

Jul 01 '25 02:07 qijiaxing

@jinyouzhi I use official docker image verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2 to convert Qwen2.5-7b-Instruct, also meet the same error.

Overridden TF init config: {'num_layers': 28, 'hidden_size': 3584, 'num_attention_heads': 28, 'num_query_groups': 4, 'ffn_hidden_size': 18944, 'attention_dropout': 0.0, 'hidden_dropout': 0.0, 'kv_channels': None, 'layernorm_epsilon': 1e-06, 'activation_func': <function silu at 0x7f9f9effa440>, 'normalization': 'RMSNorm', 'gated_linear_unit': True, 'pipeline_dtype': torch.bfloat16, 'params_dtype': torch.bfloat16, 'bf16': True, 'tensor_model_parallel_size': 1, 'pipeline_model_parallel_size': 1, 'expert_model_parallel_size': 1, 'expert_tensor_parallel_size': 1, 'virtual_pipeline_model_parallel_size': None, 'context_parallel_size': 1, 'overlap_p2p_comm': False, 'batch_p2p_comm': False, 'sequence_parallel': False, 'variable_seq_lengths': True, 'masked_softmax_fusion': True, 'moe_token_dispatcher_type': 'alltoall', 'use_cpu_initialization': False, 'add_bias_linear': False, 'add_qkv_bias': True, 'qk_layernorm': False}
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/verl/scripts/converter_hf_to_mcore.py", line 303, in <module>
[rank0]:     convert_hf_to_mcore(args.hf_model_path, args.output_path, args.use_cpu_initialization, args.test, args.trust_remote_code)
[rank0]:   File "/workspace/verl/scripts/converter_hf_to_mcore.py", line 258, in convert_hf_to_mcore
[rank0]:     model = get_model(
[rank0]:   File "/workspace/verl/verl/utils/megatron_utils.py", line 82, in get_model
[rank0]:     model = model_provider_func(pre_process=pre_process, post_process=post_process)
[rank0]:   File "/workspace/verl/scripts/converter_hf_to_mcore.py", line 248, in megatron_model_provider
[rank0]:     parallel_model = init_mcore_model(
[rank0]:   File "/workspace/verl/verl/models/mcore/registry.py", line 164, in init_mcore_model
[rank0]:     return initializer.initialize(pre_process=pre_process, post_process=post_process, share_embeddings_and_output_weights=share_embeddings_and_output_weights, value=value, **extra_kwargs)
[rank0]:   File "/workspace/verl/verl/models/mcore/model_initializer.py", line 71, in initialize
[rank0]:     model = GPTModel(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/megatron/core/models/gpt/gpt_model.py", line 132, in __init__
[rank0]:     self.embedding = LanguageModelEmbedding(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/megatron/core/models/common/embeddings/language_model_embedding.py", line 53, in __init__
[rank0]:     self.word_embeddings = tensor_parallel.VocabParallelEmbedding(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/megatron/core/tensor_parallel/layers.py", line 231, in __init__
[rank0]:     torch.empty(
[rank0]: RuntimeError: CUDA error: operation not supported
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]:[W723 14:55:52.846129107 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

#2663

Jul 23 '25 07:07 tingyecang