在使用多图像数据微调kimi-vl时训练卡死
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
llamafactoryversion: 0.9.4.dev0- Platform: Linux-5.10.134-008.16.kangaroo.al8.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.18
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 4.52.4
- Datasets version: 3.6.0
- Accelerate version: 1.7.0
- PEFT version: 0.15.2
- TRL version: 0.9.6
- GPU type: NVIDIA A800-SXM4-80GB
- GPU number: 6
- GPU memory: 79.35GB
- DeepSpeed version: 0.16.9
- Git commit: ff415d9998180b5a68bbfdda3309ec04b472fb49
- Default data directory: not detected
Reproduction
使用mllm demo中的多图像数据进行微调时,训练卡住且gpu利用率100%,移除多图像数据后可以正常训练
Others
No response
Can you share your training scripts? I remember that we have tested this model on the mllm_demo dataset.
@Kuangdd01 Thank for your reply,this is the yaml I used:
### model
model_name_or_path: /mnt/workspace/yangyunhao/Kimi-VL-A3B-Instruct
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: /mnt/workspace/yangyunhao/LLaMA-Factory-main/examples/deepspeed/ds_z3_config.json # 8xh20 gpu
### dataset
dataset: mllm_demo, identity
template: kimi_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 1
dataloader_num_workers: 4
### output
output_dir: saves/kimi-vl/full
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
I used Zero3. I've successfully tested training on alpaca demo, identity and llava, but encounter an issue where GPU utilization hits 100% and the training stucks when using multi-image data on the mllm demo.
Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, https://github.com/deepspeedai/DeepSpeed/issues/5066.
To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.
Thanks, I'll try that
Thanks, I'll try that
Have you fixed this issue under deepspeed zero3 mode? Please share some experience if possible. Much appreciated!
Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066.
To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.
dsz2 seems has the same problem.
Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.
dsz2 seems has the same problem.
dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.
Could you please share your method for feeding the fake gradients when using dsz3?
BTW, we have added fake images into the pure text batch. It still gets stuck in gradient_sync? 😭 @yyhycx
@Kuangdd01 This is the DeepseekV3MoE code I modified, I'm not sure if it is correct:
class DeepseekV3MoE(nn.Module):
"""
A mixed expert module containing shared experts.
"""
def __init__(self, config):
super().__init__()
self.config = config
self.num_experts_per_tok = config.num_experts_per_tok
if hasattr(config, "ep_size") and config.ep_size > 1:
assert config.ep_size == dist.get_world_size()
self.ep_size = config.ep_size
self.experts_per_rank = config.n_routed_experts // config.ep_size
self.ep_rank = dist.get_rank()
self.experts = nn.ModuleList(
[
(
DeepseekV3MLP(
config, intermediate_size=config.moe_intermediate_size
)
if i >= self.ep_rank * self.experts_per_rank
and i < (self.ep_rank + 1) * self.experts_per_rank
else None
)
for i in range(config.n_routed_experts)
]
)
else:
self.ep_size = 1
self.experts_per_rank = config.n_routed_experts
self.ep_rank = 0
self.experts = nn.ModuleList(
[
DeepseekV3MLP(
config, intermediate_size=config.moe_intermediate_size
)
for i in range(config.n_routed_experts)
]
)
self.gate = MoEGate(config)
if config.n_shared_experts is not None:
intermediate_size = config.moe_intermediate_size * config.n_shared_experts
self.shared_experts = DeepseekV3MLP(
config=config, intermediate_size=intermediate_size
)
def forward(self, hidden_states):
identity = hidden_states
orig_shape = hidden_states.shape
topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
if self.training:
flat_topk_idx = topk_idx.view(-1)
hidden_states = hidden_states.repeat_interleave(
self.num_experts_per_tok, dim=0
)
#y = torch.empty_like(hidden_states)
y = torch.zeros_like(hidden_states)
for i, expert in enumerate(self.experts):
mask = flat_topk_idx == i
token_indices = mask.nonzero(as_tuple=True)[0]
if token_indices.numel() == 0:
top_x_ = torch.zeros(1).to(hidden_states.device).to(torch.int32)
virtual_input = hidden_states[0:1]
fake_output = expert(virtual_input * 0)
y[0:1] += fake_output
else:
y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
y = y.to(hidden_states.dtype).view(*orig_shape)
if aux_loss is not None:
y = AddAuxiliaryLoss.apply(y, aux_loss)
else:
y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
if self.config.n_shared_experts is not None:
y = y + self.shared_experts(identity)
return y
Training proceeds normally when using only image data or only text data, but gets stuck when using both. the forward phase appears to be normal.
@Kuangdd01 I found that in the batch of data that caused the stuck, there were differences in image_grid_hws on different ranks. Could this be the problem?
===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 attention_mask: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 labels: shape=torch.Size([2, 1208]), dtype=torch.int64, device=cuda:0 pixel_values: shape=torch.Size([1472, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:0 image_grid_hws: shape=torch.Size([1, 2]), dtype=torch.int64, device=cuda:0
===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 attention_mask: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 labels: shape=torch.Size([2, 1096]), dtype=torch.int64, device=cuda:1 pixel_values: shape=torch.Size([1380, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:1 image_grid_hws: shape=torch.Size([1, 2]), dtype=torch.int64, device=cuda:1
===== DEBUG: Input Keys and Shapes ===== Current Step: 0 input_ids: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 attention_mask: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 labels: shape=torch.Size([2, 1280]), dtype=torch.int64, device=cuda:7 pixel_values: shape=torch.Size([2944, 3, 14, 14]), dtype=torch.bfloat16, device=cuda:7 image_grid_hws: shape=torch.Size([2, 2]), dtype=torch.int64, device=cuda:7
I don't think it is the root cause. Can you confirm which step raises this issue?
Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.
dsz2 seems has the same problem.
dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.
My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100
My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100
Can you use py-spy to locate the issue? I can't reproduce it with dsz2 locally.
My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100
Can you use
py-spyto locate the issue? I can't reproduce it with dsz2 locally.
Understood. Attached is the captured log file. Kindly advise if additional details are required. Appreciated. I can provide the complete SVG flame graph from a 60-second profile capture if that would be helpful for analysis.
3200.00% 3200.00% 3.20s 3.20s accept (socket.py)
490.00% 590.00% 0.490s 0.590s forward (modules/transformers_modules/modeling_kimi_vl.py)
100.00% 100.00% 0.100s 0.100s _conv_forward (torch/nn/modules/conv.py)
0.00% 590.00% 0.000s 0.590s forward (peft/peft_model.py)
0.00% 590.00% 0.000s 0.590s run_sft (llamafactory/train/sft/workflow.py)
0.00% 590.00% 0.000s 0.590s _call_impl (torch/nn/modules/module.py)
0.00% 3200.00% 0.000s 3.20s _bootstrap_inner (threading.py)
0.00% 590.00% 0.000s 0.590s call (accelerate/utils/operations.py)
0.00% 590.00% 0.000s 0.590s train (transformers/trainer.py)
0.00% 3200.00% 0.000s 3.20s _bootstrap (threading.py)
0.00% 3200.00% 0.000s 3.20s _serve (multiprocessing/resource_sharer.py)
0.00% 590.00% 0.000s 0.590s _training_function (llamafactory/train/tuner.py)
0.00% 590.00% 0.000s 0.590s forward (torch/nn/parallel/distributed.py)
0.00% 590.00% 0.000s 0.590s _wrapped_call_impl (torch/nn/modules/module.py)
0.00% 3200.00% 0.000s 3.20s accept (multiprocessing/connection.py)
0.00% 590.00% 0.000s 0.590s _run_ddp_forward (torch/nn/parallel/distributed.py)
0.00% 590.00% 0.000s 0.590s forward (peft/tuners/tuners_utils.py)
0.00% 100.00% 0.000s 0.100s forward (torch/nn/modules/conv.py)
0.00% 590.00% 0.000s 0.590s _extract_image_features (modules/transformers_modules/modeling_kimi_vl.py)
0.00% 590.00% 0.000s 0.590s forward (accelerate/utils/operations.py)
0.00% 3200.00% 0.000s 3.20s run (threading.py)
0.00% 590.00% 0.000s 0.590s decorate_autocast (torch/amp/autocast_mode.py)
0.00% 590.00% 0.000s 0.590s compute_loss (llamafactory/train/sft/trainer.py)
0.00% 590.00% 0.000s 0.590s compute_loss (transformers/trainer.py)
0.00% 590.00% 0.000s 0.590s
Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, deepspeedai/DeepSpeed#5066. To avoid the gradient disagreement issue when using dsz3, I suggest you use dsz2 or try to feed some fake input(with gradient) to some experts like this.
dsz2 seems has the same problem.
dsz2 can work, but i meet OOM using 8xA800. I tried using fake gradients to solve the dsz3 problem, but it still gets stuck when using mixed image and text data.
My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100
UPDATE: DSZ2 works well. Error setting before.
遇到了同样的问题:采用8*H20,使用dsz3,DPO训练rlhf-v数据集。采用dsz3会导致训练卡住且gpu利用率100%。
当切回dsz2,由于使用的是DPO,加载Kimi-VL的时候,在加载ref_model时,又会报OOM错误。
有好的解决方案么?
Is this issue still there?
Is this issue still there?
yes
有个问题,这里 dataset: mllm_demo, identity 数据集很少啊 ,就没几条??