Yang comments

Results 9 comments of


                                            Yang

奇怪的out of memory报错

> > ![Image](https://github.com/user-attachments/assets/c488d122-c36b-441f-99bc-ba0c036ba34e) 我是微调大概1500-2000轮时出现killed，本来3w条训练数据时没问题，增加到6w条就会这样，这是bug嘛？ > > 这是不是内存爆掉了 hello,这个问题后来解决了吗

在使用多图像数据微调kimi-vl时训练卡死

> Thanks, I'll try that Have you fixed this issue under deepspeed zero3 mode? Please share some experience if possible. Much appreciated!

> Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). > > To avoid...

在使用多图像数据微调kimi-vl时训练卡死

> > > Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). > >...

在使用多图像数据微调kimi-vl时训练卡死

> > My dataset has samples containing 1/2 images. When training under dsz2， it gets stcuk. Training machine: 32*A100 > > Can you use `py-spy` to locate the issue? I...

在使用多图像数据微调kimi-vl时训练卡死

> > > > Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). >...

在使用多图像数据微调kimi-vl时训练卡死

> Is this issue still there? yes

Enable logits extraction from vLLM for training

To evaluate the effectiveness of this change, we conducted four comparative experiments: 1. **Synchronous training** using the original training engine's logits. 2. **Asynchronous training** using the original training engine's logits....

Enable logits extraction from vLLM for training

> The async LLM for the agentic RL also needs to support vLLM logits. Updated the corresponding training code. Local debugging shows no errors currently. Please review when you have...