LLaMA-Factory 在使用多图像数据微调kimi-vl时训练卡死

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.4.dev0
Platform: Linux-5.10.134-008.16.kangaroo.al8.x86_64-x86_64-with-glibc2.35
Python version: 3.10.18
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.52.4
Datasets version: 3.6.0
Accelerate version: 1.7.0
PEFT version: 0.15.2
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-80GB
GPU number: 6
GPU memory: 79.35GB
DeepSpeed version: 0.16.9
Git commit: ff415d9998180b5a68bbfdda3309ec04b472fb49
Default data directory: not detected

Reproduction

使用mllm demo中的多图像数据进行微调时，训练卡住且gpu利用率100%，移除多图像数据后可以正常训练

Others

No response

Jul 25 '25 10:07 yyhycx

Can you share your training scripts? I remember that we have tested this model on the mllm_demo dataset.

Jul 26 '25 06:07 Kuangdd01

@Kuangdd01 Thank for your reply，this is the yaml I used：

### model
model_name_or_path: /mnt/workspace/yangyunhao/Kimi-VL-A3B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: /mnt/workspace/yangyunhao/LLaMA-Factory-main/examples/deepspeed/ds_z3_config.json # 8xh20 gpu

### dataset
dataset: mllm_demo, identity
template: kimi_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 1
dataloader_num_workers: 4

### output
output_dir: saves/kimi-vl/full
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

I used Zero3. I've successfully tested training on alpaca demo, identity and llava, but encounter an issue where GPU utilization hits 100% and the training stucks when using multi-image data on the mllm demo.

Jul 28 '25 02:07 yyhycx

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, https://github.com/deepspeedai/DeepSpeed/issues/5066.

To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

Jul 28 '25 08:07 Kuangdd01

Thanks, I'll try that

Jul 29 '25 02:07 yyhycx

Thanks, I'll try that

Have you fixed this issue under deepspeed zero3 mode? Please share some experience if possible. Much appreciated!

Aug 03 '25 07:08 MooMoo-Yang

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066.

To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

Aug 04 '25 15:08 MooMoo-Yang

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.

Aug 07 '25 08:08 yyhycx

Could you please share your method for feeding the fake gradients when using dsz3? BTW, we have added fake images into the pure text batch. It still gets stuck in gradient_sync? 😭 @yyhycx

Aug 07 '25 08:08 Kuangdd01

@Kuangdd01 This is the DeepseekV3MoE code I modified, I'm not sure if it is correct：

class DeepseekV3MoE(nn.Module):
    """
    A mixed expert module containing shared experts.
    """

    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts_per_tok = config.num_experts_per_tok

        if hasattr(config, "ep_size") and config.ep_size > 1:
            assert config.ep_size == dist.get_world_size()
            self.ep_size = config.ep_size
            self.experts_per_rank = config.n_routed_experts // config.ep_size
            self.ep_rank = dist.get_rank()
            self.experts = nn.ModuleList(
                [
                    (
                        DeepseekV3MLP(
                            config, intermediate_size=config.moe_intermediate_size
                        )
                        if i >= self.ep_rank * self.experts_per_rank
                        and i < (self.ep_rank + 1) * self.experts_per_rank
                        else None
                    )
                    for i in range(config.n_routed_experts)
                ]
            )
        else:
            self.ep_size = 1
            self.experts_per_rank = config.n_routed_experts
            self.ep_rank = 0
            self.experts = nn.ModuleList(
                [
                    DeepseekV3MLP(
                        config, intermediate_size=config.moe_intermediate_size
                    )
                    for i in range(config.n_routed_experts)
                ]
            )
        self.gate = MoEGate(config)
        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekV3MLP(
                config=config, intermediate_size=intermediate_size
            )

    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        if self.training:
            flat_topk_idx = topk_idx.view(-1)
            hidden_states = hidden_states.repeat_interleave(
                self.num_experts_per_tok, dim=0
            )
            #y = torch.empty_like(hidden_states)
            y = torch.zeros_like(hidden_states)
            for i, expert in enumerate(self.experts):
                mask = flat_topk_idx == i
                token_indices = mask.nonzero(as_tuple=True)[0]
                if token_indices.numel() == 0:
                     top_x_ = torch.zeros(1).to(hidden_states.device).to(torch.int32)
                     virtual_input = hidden_states[0:1]
                     fake_output = expert(virtual_input * 0)
                     y[0:1] += fake_output
                 else:
                     y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])

            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
            y = y.to(hidden_states.dtype).view(*orig_shape)
            if aux_loss is not None:
                y = AddAuxiliaryLoss.apply(y, aux_loss)
        else:
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity)
        return y

Training proceeds normally when using only image data or only text data, but gets stuck when using both. the forward phase appears to be normal.

Aug 07 '25 09:08 yyhycx

@Kuangdd01 I found that in the batch of data that caused the stuck, there were differences in image_grid_hws on different ranks. Could this be the problem?

===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 attention_mask: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 labels: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 pixel_values: shape=torch.Size([1472, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:0 image_grid_hws: shape=torch.Size([1, 2]), dtype=torch.int64, device=cuda:0

===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 attention_mask: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 labels: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 pixel_values: shape=torch.Size([1380, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:1 image_grid_hws: shape=torch.Size([1, 2]), dtype=torch.int64, device=cuda:1

===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 attention_mask: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 labels: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 pixel_values: shape=torch.Size([2944, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:7 image_grid_hws: shape=torch.Size([2, 2]), dtype=torch.int64, device=cuda:7

Aug 07 '25 12:08 yyhycx

I don't think it is the root cause. Can you confirm which step raises this issue?

Aug 08 '25 08:08 Kuangdd01

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.

My dataset has samples containing 1/2 images. When training under dsz2， it gets stcuk. Training machine: 32*A100

Aug 08 '25 10:08 MooMoo-Yang

My dataset has samples containing 1/2 images. When training under dsz2， it gets stcuk. Training machine: 32*A100

Can you use py-spy to locate the issue? I can't reproduce it with dsz2 locally.

Aug 09 '25 14:08 Kuangdd01

My dataset has samples containing 1/2 images. When training under dsz2， it gets stcuk. Training machine: 32*A100

Can you use py-spy to locate the issue? I can't reproduce it with dsz2 locally.

Understood. Attached is the captured log file. Kindly advise if additional details are required. Appreciated. I can provide the complete SVG flame graph from a 60-second profile capture if that would be helpful for analysis.

3200.00% 3200.00% 3.20s 3.20s accept (socket.py) 490.00% 590.00% 0.490s 0.590s forward (modules/transformers_modules/modeling_kimi_vl.py) 100.00% 100.00% 0.100s 0.100s _conv_forward (torch/nn/modules/conv.py) 0.00% 590.00% 0.000s 0.590s forward (peft/peft_model.py) 0.00% 590.00% 0.000s 0.590s run_sft (llamafactory/train/sft/workflow.py) 0.00% 590.00% 0.000s 0.590s _call_impl (torch/nn/modules/module.py) 0.00% 3200.00% 0.000s 3.20s _bootstrap_inner (threading.py) 0.00% 590.00% 0.000s 0.590s call (accelerate/utils/operations.py) 0.00% 590.00% 0.000s 0.590s train (transformers/trainer.py) 0.00% 3200.00% 0.000s 3.20s _bootstrap (threading.py) 0.00% 3200.00% 0.000s 3.20s _serve (multiprocessing/resource_sharer.py) 0.00% 590.00% 0.000s 0.590s _training_function (llamafactory/train/tuner.py) 0.00% 590.00% 0.000s 0.590s forward (torch/nn/parallel/distributed.py) 0.00% 590.00% 0.000s 0.590s _wrapped_call_impl (torch/nn/modules/module.py) 0.00% 3200.00% 0.000s 3.20s accept (multiprocessing/connection.py) 0.00% 590.00% 0.000s 0.590s _run_ddp_forward (torch/nn/parallel/distributed.py) 0.00% 590.00% 0.000s 0.590s forward (peft/tuners/tuners_utils.py) 0.00% 100.00% 0.000s 0.100s forward (torch/nn/modules/conv.py) 0.00% 590.00% 0.000s 0.590s _extract_image_features (modules/transformers_modules/modeling_kimi_vl.py) 0.00% 590.00% 0.000s 0.590s forward (accelerate/utils/operations.py) 0.00% 3200.00% 0.000s 3.20s run (threading.py) 0.00% 590.00% 0.000s 0.590s decorate_autocast (torch/amp/autocast_mode.py) 0.00% 590.00% 0.000s 0.590s compute_loss (llamafactory/train/sft/trainer.py) 0.00% 590.00% 0.000s 0.590s compute_loss (transformers/trainer.py) 0.00% 590.00% 0.000s 0.590s (train.py) 0.00% 590.00% 0.000s 0.590s run_exp (llamafactory/train/tuner.py) 0.00% 590.00% 0.000s 0.590s main (train.py) 0.00% 590.00% 0.000s 0.590s training_step (transformers/trainer.py) 0.00% 590.00% 0.000s 0.590s _inner_training_loop (transformers/trainer.py)

Aug 11 '25 02:08 MooMoo-Yang

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.

My dataset has samples containing 1/2 images. When training under dsz2， it gets stcuk. Training machine: 32*A100

UPDATE: DSZ2 works well. Error setting before.

Aug 14 '25 10:08 MooMoo-Yang

遇到了同样的问题：采用8*H20，使用dsz3，DPO训练rlhf-v数据集。采用dsz3会导致训练卡住且gpu利用率100%。

当切回dsz2，由于使用的是DPO，加载Kimi-VL的时候，在加载ref_model时，又会报OOM错误。

有好的解决方案么？

Aug 26 '25 13:08 zzfoutofspace

Is this issue still there?

Sep 15 '25 01:09 mertunsall

Is this issue still there?

yes

Sep 15 '25 09:09 MooMoo-Yang

有个问题，这里 dataset: mllm_demo, identity 数据集很少啊，就没几条？？

Nov 07 '25 03:11 sunnysky29