LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

在使用多图像数据微调kimi-vl时训练卡死

Open yyhycx opened this issue 5 months ago • 19 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.4.dev0
  • Platform: Linux-5.10.134-008.16.kangaroo.al8.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.18
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.52.4
  • Datasets version: 3.6.0
  • Accelerate version: 1.7.0
  • PEFT version: 0.15.2
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • GPU number: 6
  • GPU memory: 79.35GB
  • DeepSpeed version: 0.16.9
  • Git commit: ff415d9998180b5a68bbfdda3309ec04b472fb49
  • Default data directory: not detected

Reproduction

使用mllm demo中的多图像数据进行微调时,训练卡住且gpu利用率100%,移除多图像数据后可以正常训练

Others

No response

yyhycx avatar Jul 25 '25 10:07 yyhycx

Can you share your training scripts? I remember that we have tested this model on the mllm_demo dataset.

Kuangdd01 avatar Jul 26 '25 06:07 Kuangdd01

@Kuangdd01 Thank for your reply,this is the yaml I used:

### model
model_name_or_path: /mnt/workspace/yangyunhao/Kimi-VL-A3B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: /mnt/workspace/yangyunhao/LLaMA-Factory-main/examples/deepspeed/ds_z3_config.json # 8xh20 gpu

### dataset
dataset: mllm_demo, identity
template: kimi_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 1
dataloader_num_workers: 4

### output
output_dir: saves/kimi-vl/full
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

I used Zero3. I've successfully tested training on alpaca demo, identity and llava, but encounter an issue where GPU utilization hits 100% and the training stucks when using multi-image data on the mllm demo.

yyhycx avatar Jul 28 '25 02:07 yyhycx

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, https://github.com/deepspeedai/DeepSpeed/issues/5066.

To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

Kuangdd01 avatar Jul 28 '25 08:07 Kuangdd01

Thanks, I'll try that

yyhycx avatar Jul 29 '25 02:07 yyhycx

Thanks, I'll try that

Have you fixed this issue under deepspeed zero3 mode? Please share some experience if possible. Much appreciated!

MooMoo-Yang avatar Aug 03 '25 07:08 MooMoo-Yang

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066.

To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

MooMoo-Yang avatar Aug 04 '25 15:08 MooMoo-Yang

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.

yyhycx avatar Aug 07 '25 08:08 yyhycx

Could you please share your method for feeding the fake gradients when using dsz3? BTW, we have added fake images into the pure text batch. It still gets stuck in gradient_sync? 😭 @yyhycx

Kuangdd01 avatar Aug 07 '25 08:08 Kuangdd01

@Kuangdd01 This is the DeepseekV3MoE code I modified, I'm not sure if it is correct:

class DeepseekV3MoE(nn.Module):
    """
    A mixed expert module containing shared experts.
    """

    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts_per_tok = config.num_experts_per_tok

        if hasattr(config, "ep_size") and config.ep_size > 1:
            assert config.ep_size == dist.get_world_size()
            self.ep_size = config.ep_size
            self.experts_per_rank = config.n_routed_experts // config.ep_size
            self.ep_rank = dist.get_rank()
            self.experts = nn.ModuleList(
                [
                    (
                        DeepseekV3MLP(
                            config, intermediate_size=config.moe_intermediate_size
                        )
                        if i >= self.ep_rank * self.experts_per_rank
                        and i < (self.ep_rank + 1) * self.experts_per_rank
                        else None
                    )
                    for i in range(config.n_routed_experts)
                ]
            )
        else:
            self.ep_size = 1
            self.experts_per_rank = config.n_routed_experts
            self.ep_rank = 0
            self.experts = nn.ModuleList(
                [
                    DeepseekV3MLP(
                        config, intermediate_size=config.moe_intermediate_size
                    )
                    for i in range(config.n_routed_experts)
                ]
            )
        self.gate = MoEGate(config)
        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekV3MLP(
                config=config, intermediate_size=intermediate_size
            )

    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        if self.training:
            flat_topk_idx = topk_idx.view(-1)
            hidden_states = hidden_states.repeat_interleave(
                self.num_experts_per_tok, dim=0
            )
            #y = torch.empty_like(hidden_states)
            y = torch.zeros_like(hidden_states)
            for i, expert in enumerate(self.experts):
                mask = flat_topk_idx == i
                token_indices = mask.nonzero(as_tuple=True)[0]
                if token_indices.numel() == 0:
                     top_x_ = torch.zeros(1).to(hidden_states.device).to(torch.int32)
                     virtual_input = hidden_states[0:1]
                     fake_output = expert(virtual_input * 0)
                     y[0:1] += fake_output
                 else:
                     y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])

            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
            y = y.to(hidden_states.dtype).view(*orig_shape)
            if aux_loss is not None:
                y = AddAuxiliaryLoss.apply(y, aux_loss)
        else:
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity)
        return y

Training proceeds normally when using only image data or only text data, but gets stuck when using both. the forward phase appears to be normal.

yyhycx avatar Aug 07 '25 09:08 yyhycx

@Kuangdd01 I found that in the batch of data that caused the stuck, there were differences in image_grid_hws on different ranks. Could this be the problem?

===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 attention_mask: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 labels: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 pixel_values: shape=torch.Size([1472, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:0 image_grid_hws: shape=torch.Size([1, 2]), dtype=torch.int64, device=cuda:0

===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 attention_mask: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 labels: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 pixel_values: shape=torch.Size([1380, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:1 image_grid_hws: shape=torch.Size([1, 2]), dtype=torch.int64, device=cuda:1

===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 attention_mask: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 labels: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 pixel_values: shape=torch.Size([2944, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:7 image_grid_hws: shape=torch.Size([2, 2]), dtype=torch.int64, device=cuda:7

yyhycx avatar Aug 07 '25 12:08 yyhycx

I don't think it is the root cause. Can you confirm which step raises this issue?

Kuangdd01 avatar Aug 08 '25 08:08 Kuangdd01

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.

My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100

MooMoo-Yang avatar Aug 08 '25 10:08 MooMoo-Yang

My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100

Can you use py-spy to locate the issue? I can't reproduce it with dsz2 locally.

Kuangdd01 avatar Aug 09 '25 14:08 Kuangdd01

My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100

Can you use py-spy to locate the issue? I can't reproduce it with dsz2 locally.

Understood. Attached is the captured log file. Kindly advise if additional details are required. Appreciated. I can provide the complete SVG flame graph from a 60-second profile capture if that would be helpful for analysis.

3200.00% 3200.00% 3.20s 3.20s accept (socket.py) 490.00% 590.00% 0.490s 0.590s forward (modules/transformers_modules/modeling_kimi_vl.py) 100.00% 100.00% 0.100s 0.100s _conv_forward (torch/nn/modules/conv.py) 0.00% 590.00% 0.000s 0.590s forward (peft/peft_model.py) 0.00% 590.00% 0.000s 0.590s run_sft (llamafactory/train/sft/workflow.py) 0.00% 590.00% 0.000s 0.590s _call_impl (torch/nn/modules/module.py) 0.00% 3200.00% 0.000s 3.20s _bootstrap_inner (threading.py) 0.00% 590.00% 0.000s 0.590s call (accelerate/utils/operations.py) 0.00% 590.00% 0.000s 0.590s train (transformers/trainer.py) 0.00% 3200.00% 0.000s 3.20s _bootstrap (threading.py) 0.00% 3200.00% 0.000s 3.20s _serve (multiprocessing/resource_sharer.py) 0.00% 590.00% 0.000s 0.590s _training_function (llamafactory/train/tuner.py) 0.00% 590.00% 0.000s 0.590s forward (torch/nn/parallel/distributed.py) 0.00% 590.00% 0.000s 0.590s _wrapped_call_impl (torch/nn/modules/module.py) 0.00% 3200.00% 0.000s 3.20s accept (multiprocessing/connection.py) 0.00% 590.00% 0.000s 0.590s _run_ddp_forward (torch/nn/parallel/distributed.py) 0.00% 590.00% 0.000s 0.590s forward (peft/tuners/tuners_utils.py) 0.00% 100.00% 0.000s 0.100s forward (torch/nn/modules/conv.py) 0.00% 590.00% 0.000s 0.590s _extract_image_features (modules/transformers_modules/modeling_kimi_vl.py) 0.00% 590.00% 0.000s 0.590s forward (accelerate/utils/operations.py) 0.00% 3200.00% 0.000s 3.20s run (threading.py) 0.00% 590.00% 0.000s 0.590s decorate_autocast (torch/amp/autocast_mode.py) 0.00% 590.00% 0.000s 0.590s compute_loss (llamafactory/train/sft/trainer.py) 0.00% 590.00% 0.000s 0.590s compute_loss (transformers/trainer.py) 0.00% 590.00% 0.000s 0.590s (train.py) 0.00% 590.00% 0.000s 0.590s run_exp (llamafactory/train/tuner.py) 0.00% 590.00% 0.000s 0.590s main (train.py) 0.00% 590.00% 0.000s 0.590s training_step (transformers/trainer.py) 0.00% 590.00% 0.000s 0.590s _inner_training_loop (transformers/trainer.py)

MooMoo-Yang avatar Aug 11 '25 02:08 MooMoo-Yang

Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.

dsz2 seems has the same problem.

dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.

My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100

UPDATE: DSZ2 works well. Error setting before.

MooMoo-Yang avatar Aug 14 '25 10:08 MooMoo-Yang

遇到了同样的问题:采用8*H20,使用dsz3,DPO训练rlhf-v数据集。采用dsz3会导致训练卡住且gpu利用率100%。

当切回dsz2,由于使用的是DPO,加载Kimi-VL的时候,在加载ref_model时,又会报OOM错误。

有好的解决方案么?

zzfoutofspace avatar Aug 26 '25 13:08 zzfoutofspace

Is this issue still there?

mertunsall avatar Sep 15 '25 01:09 mertunsall

Is this issue still there?

yes

MooMoo-Yang avatar Sep 15 '25 09:09 MooMoo-Yang

有个问题,这里 dataset: mllm_demo, identity 数据集很少啊 ,就没几条??

sunnysky29 avatar Nov 07 '25 03:11 sunnysky29