InternVL Stuck while training InternVL3.5-30B-A3B

Hi，感谢开源~ 有个问题求解~

我在 4 机 8 卡 H20 开 zero-3 训 InternVl3.5-30B-A3B 的时候，一直 hang 住了，log 截图如下

GPU 一直是利用率 100%、显存 10G 的样子，显然是没开始训

同样环境，训 4B 的 dense 模型能正常。按照 dense 的训练经验，上述 log 打印之后应该是进入 training step 记录的，结果是一直 hang 着。

不管是 packing 还是 unpacking，都进到上面的 log 然后开始 hang 住，试了下 zero stage2 也会卡住。请问怎么解决呢？

InternVL3_5-GPT-OSS-20B-A4B-Preview 模型训练是没问题的，按照提供脚本中的 --use_custom_flash_attn True；上面 hang 的 InternVL3.5-30B-A3B 模型也是按相应脚本里的 --use_custom_flash_attn False。请问 internvl3_5_qwen3 目录下 InternVL3.5-30B-A3B 这个 MOE 模型训练脚本执行是不是还额外有什么特殊配置？超参或者环境？

Sep 24 '25 17:09 shuoyinn

@shuoyinn Hi shuoyinn.

If you will be able to solve this, can you let me know. It's something related to Qwen3MOE on DeepSpeed as I understand, with FSDP it might be possible to train but I was not able to do it.

Oct 23 '25 11:10 vladimiralbrekhtccr

@shuoyinn Hi shuoyinn.

If you will be able to solve this, can you let me know. It's something related to Qwen3MOE on DeepSpeed as I understand, with FSDP it might be possible to train but I was not able to do it.

@vladimiralbrekhtccr

I've tried many times and still cannot solve it, but I'll share if I could.

Strange, if DeepSpeed doesn't work for Qwen3MoE as you mentioned, I will also encounter this issue with Qwen3-VL-30B-A3B-Instruct, but I can successfully run training of Qwen3-VL-30B-A3B-Instruct via qwen3vl official code https://github.com/QwenLM/Qwen3-VL.

I guess the requirements file internvl_chat_gpt_oss/requirements.txt provided by the author is actually only for the gpt-oss model (20B) instead of the Qwen3MoE model (30B), because I can run the training of InternVL3.5-20B-A4B successfully.

Oct 23 '25 12:10 shuoyinn

@shuoyinn Hi shuoyinn. If you will be able to solve this, can you let me know. It's something related to Qwen3MOE on DeepSpeed as I understand, with FSDP it might be possible to train but I was not able to do it.

@vladimiralbrekhtccr

I've tried many times and still cannot solve it, but I'll share if I could.

Strange, if DeepSpeed doesn't work for Qwen3MoE as you mentioned, I will also encounter this issue with Qwen3-VL-30B-A3B-Instruct, but I can successfully run training of Qwen3-VL-30B-A3B-Instruct via qwen3vl official code https://github.com/QwenLM/Qwen3-VL.

I guess the requirements file internvl_chat_gpt_oss/requirements.txt provided by the author is actually only for the gpt-oss model (20B) instead of the Qwen3MoE model (30B), because I can run the training of InternVL3.5-20B-A4B successfully.

@shuoyinn Hi Shuoyinn. Thanks for the response. I was trying to run the Qwen3-VL-30B-A3B-Instruct via qwen3vl official code but not successful, if you might be able to share how you was training it, that will be very helpful, I'm struggling with this training using DeepSpeed_stage 3 and I found few related issues where people said that Deepspeed doesn't support Qwen3-VL-30B-A3B-Instruct with stage3. I wonder if I'm doing something wrong, but I understand that DeepSpeed stage 2 works well if you have enough GPUs, but I'm gpu poor so I want to train using Stage3. 😢

Have a good day. 🤗

Oct 29 '25 09:10 vladimiralbrekhtccr

@shuoyinn Hi shuoyinn. If you will be able to solve this, can you let me know. It's something related to Qwen3MOE on DeepSpeed as I understand, with FSDP it might be possible to train but I was not able to do it.

@vladimiralbrekhtccr I've tried many times and still cannot solve it, but I'll share if I could. Strange, if DeepSpeed doesn't work for Qwen3MoE as you mentioned, I will also encounter this issue with Qwen3-VL-30B-A3B-Instruct, but I can successfully run training of Qwen3-VL-30B-A3B-Instruct via qwen3vl official code https://github.com/QwenLM/Qwen3-VL. I guess the requirements file internvl_chat_gpt_oss/requirements.txt provided by the author is actually only for the gpt-oss model (20B) instead of the Qwen3MoE model (30B), because I can run the training of InternVL3.5-20B-A4B successfully.

@shuoyinn Hi Shuoyinn. Thanks for the response. I was trying to run the Qwen3-VL-30B-A3B-Instruct via qwen3vl official code but not successful, if you might be able to share how you was training it, that will be very helpful, I'm struggling with this training using DeepSpeed_stage 3 and I found few related issues where people said that Deepspeed doesn't support Qwen3-VL-30B-A3B-Instruct with stage3. I wonder if I'm doing something wrong, but I understand that DeepSpeed stage 2 works well if you have enough GPUs, but I'm gpu poor so I want to train using Stage3. 😢

Have a good day. 🤗

The environment I trained Qwen3VL-30B-A3B is based on the recommended requirements, thought there are some diffs. My cuda version is 12.4

python3.10 -m pip install torch==2.6.0 torchvision==0.21.0 deepspeed==0.17.1
python3.10 -m pip install /mnt/bn/shuoyinnas-hl/whls/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl 
python3.10 -m pip install transformers==4.57.0
python3.10 -m pip install triton==3.2.0
python3.10 -m pip install accelerate==1.7.0
python3.10 -m pip install torchcodec==0.2 
python3.10 -m pip install --upgrade bitsandbytes

python3.10 -m pip install datasets
python3.10 -m pip install decord
python3.10 -m pip install tensorboard

About deepspeed, I use zero2, and the config file is my customized one based on those used in LLamafactory, InternVL and QwenVL repositories (cannot remember more details of how I modify them).

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "round_robin_gradients": true
  }
}

About InternVL3.5-30B-A3B, I still cannot successfully start the training 😢. Maybe next I'd better try Llamafactory

Oct 29 '25 12:10 shuoyinn

@shuoyinn Hi shuoyinn. If you will be able to solve this, can you let me know. It's something related to Qwen3MOE on DeepSpeed as I understand, with FSDP it might be possible to train but I was not able to do it.

@vladimiralbrekhtccr I've tried many times and still cannot solve it, but I'll share if I could. Strange, if DeepSpeed doesn't work for Qwen3MoE as you mentioned, I will also encounter this issue with Qwen3-VL-30B-A3B-Instruct, but I can successfully run training of Qwen3-VL-30B-A3B-Instruct via qwen3vl official code https://github.com/QwenLM/Qwen3-VL. I guess the requirements file internvl_chat_gpt_oss/requirements.txt provided by the author is actually only for the gpt-oss model (20B) instead of the Qwen3MoE model (30B), because I can run the training of InternVL3.5-20B-A4B successfully.

@shuoyinn Hi Shuoyinn. Thanks for the response. I was trying to run the Qwen3-VL-30B-A3B-Instruct via qwen3vl official code but not successful, if you might be able to share how you was training it, that will be very helpful, I'm struggling with this training using DeepSpeed_stage 3 and I found few related issues where people said that Deepspeed doesn't support Qwen3-VL-30B-A3B-Instruct with stage3. I wonder if I'm doing something wrong, but I understand that DeepSpeed stage 2 works well if you have enough GPUs, but I'm gpu poor so I want to train using Stage3. 😢 Have a good day. 🤗

The environment I trained Qwen3VL-30B-A3B is based on the recommended requirements, thought there are some diffs. My cuda version is 12.4

python3.10 -m pip install torch==2.6.0 torchvision==0.21.0 deepspeed==0.17.1 python3.10 -m pip install /mnt/bn/shuoyinnas-hl/whls/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl python3.10 -m pip install transformers==4.57.0 python3.10 -m pip install triton==3.2.0 python3.10 -m pip install accelerate==1.7.0 python3.10 -m pip install torchcodec==0.2 python3.10 -m pip install --upgrade bitsandbytes

python3.10 -m pip install datasets python3.10 -m pip install decord python3.10 -m pip install tensorboard About deepspeed, I use zero2, and the config file is my customized one based on those used in LLamafactory, InternVL and QwenVL repositories (cannot remember more details of how I modify them).

{ "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true, "round_robin_gradients": true } } About InternVL3.5-30B-A3B, I still cannot successfully start the training 😢. Maybe next I'd better try Llamafactory

Huge thanks Shuo Yin. I was struggling a lot with the training using original Qwen3VL repo. Today I tried LlamaFactory and everything worked perfectly with DeepSpeed Stage3. Not sure what was causing problems in original repo. GL with InternVL3.5-30B-A3B.

If I will work on this model later and will be able to train on DeepSpeed stage 2 I will tell you how. 🍀

Oct 29 '25 14:10 vladimiralbrekhtccr

@shuoyinn Hi shuoyinn. If you will be able to solve this, can you let me know. It's something related to Qwen3MOE on DeepSpeed as I understand, with FSDP it might be possible to train but I was not able to do it.

@vladimiralbrekhtccr I've tried many times and still cannot solve it, but I'll share if I could. Strange, if DeepSpeed doesn't work for Qwen3MoE as you mentioned, I will also encounter this issue with Qwen3-VL-30B-A3B-Instruct, but I can successfully run training of Qwen3-VL-30B-A3B-Instruct via qwen3vl official code https://github.com/QwenLM/Qwen3-VL. I guess the requirements file internvl_chat_gpt_oss/requirements.txt provided by the author is actually only for the gpt-oss model (20B) instead of the Qwen3MoE model (30B), because I can run the training of InternVL3.5-20B-A4B successfully.

@shuoyinn Hi Shuoyinn. Thanks for the response. I was trying to run the Qwen3-VL-30B-A3B-Instruct via qwen3vl official code but not successful, if you might be able to share how you was training it, that will be very helpful, I'm struggling with this training using DeepSpeed_stage 3 and I found few related issues where people said that Deepspeed doesn't support Qwen3-VL-30B-A3B-Instruct with stage3. I wonder if I'm doing something wrong, but I understand that DeepSpeed stage 2 works well if you have enough GPUs, but I'm gpu poor so I want to train using Stage3. 😢 Have a good day. 🤗

The environment I trained Qwen3VL-30B-A3B is based on the recommended requirements, thought there are some diffs. My cuda version is 12.4

python3.10 -m pip install torch==2.6.0 torchvision==0.21.0 deepspeed==0.17.1 python3.10 -m pip install /mnt/bn/shuoyinnas-hl/whls/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl python3.10 -m pip install transformers==4.57.0 python3.10 -m pip install triton==3.2.0 python3.10 -m pip install accelerate==1.7.0 python3.10 -m pip install torchcodec==0.2 python3.10 -m pip install --upgrade bitsandbytes

python3.10 -m pip install datasets python3.10 -m pip install decord python3.10 -m pip install tensorboard About deepspeed, I use zero2, and the config file is my customized one based on those used in LLamafactory, InternVL and QwenVL repositories (cannot remember more details of how I modify them).

{ "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true, "round_robin_gradients": true } } About InternVL3.5-30B-A3B, I still cannot successfully start the training 😢. Maybe next I'd better try Llamafactory

Hello shuoyinn,

Can you run InternVL3.5-30B-A3B training successfully on LLamaFactory now?

Nov 04 '25 07:11 zzzzzigzag

@shuoyinn Hi shuoyinn. If you will be able to solve this, can you let me know. It's something related to Qwen3MOE on DeepSpeed as I understand, with FSDP it might be possible to train but I was not able to do it.

@vladimiralbrekhtccr I've tried many times and still cannot solve it, but I'll share if I could. Strange, if DeepSpeed doesn't work for Qwen3MoE as you mentioned, I will also encounter this issue with Qwen3-VL-30B-A3B-Instruct, but I can successfully run training of Qwen3-VL-30B-A3B-Instruct via qwen3vl official code https://github.com/QwenLM/Qwen3-VL. I guess the requirements file internvl_chat_gpt_oss/requirements.txt provided by the author is actually only for the gpt-oss model (20B) instead of the Qwen3MoE model (30B), because I can run the training of InternVL3.5-20B-A4B successfully.

@shuoyinn Hi Shuoyinn. Thanks for the response. I was trying to run the Qwen3-VL-30B-A3B-Instruct via qwen3vl official code but not successful, if you might be able to share how you was training it, that will be very helpful, I'm struggling with this training using DeepSpeed_stage 3 and I found few related issues where people said that Deepspeed doesn't support Qwen3-VL-30B-A3B-Instruct with stage3. I wonder if I'm doing something wrong, but I understand that DeepSpeed stage 2 works well if you have enough GPUs, but I'm gpu poor so I want to train using Stage3. 😢 Have a good day. 🤗

The environment I trained Qwen3VL-30B-A3B is based on the recommended requirements, thought there are some diffs. My cuda version is 12.4 python3.10 -m pip install torch==2.6.0 torchvision==0.21.0 deepspeed==0.17.1 python3.10 -m pip install /mnt/bn/shuoyinnas-hl/whls/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl python3.10 -m pip install transformers==4.57.0 python3.10 -m pip install triton==3.2.0 python3.10 -m pip install accelerate==1.7.0 python3.10 -m pip install torchcodec==0.2 python3.10 -m pip install --upgrade bitsandbytes python3.10 -m pip install datasets python3.10 -m pip install decord python3.10 -m pip install tensorboard About deepspeed, I use zero2, and the config file is my customized one based on those used in LLamafactory, InternVL and QwenVL repositories (cannot remember more details of how I modify them). { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true, "round_robin_gradients": true } } About InternVL3.5-30B-A3B, I still cannot successfully start the training 😢. Maybe next I'd better try Llamafactory

Hello shuoyinn,

Can you run InternVL3.5-30B-A3B training successfully on LLamaFactory now?

Hi, using llamafactory I can run InternVL3.5-30B-A3B with deepspeed zero-3. But the gpu occupation was strange: about 10G/90G on each H20 (4 nodes). Meanwhile, the training was very slow (but I'm sure the training began because the step num was increasing). I really cannot accept that speed. By the way, OOM with zero-2.

I guess it was about deepspeed config file, but I failed to modify and get a suitable one :(

Nov 04 '25 12:11 shuoyinn

Hi @Weiyun1025 大佬辛苦看下呢~

请问 InternVL3.5-30B-A3B 微调的 python envs 也是使用这个文件中的吗？ internvl_chat_gpt_oss/requirements.txt

另外实测环境的 cuda 版本呢？我试过 cu124、cu126 都出现上述训练卡住的问题，是否应该用 cu128 呢？

Nov 09 '25 14:11 shuoyinn