Yang

Results 9 comments of Yang

> > ![Image](https://github.com/user-attachments/assets/c488d122-c36b-441f-99bc-ba0c036ba34e) 我是微调大概1500-2000轮时出现killed,本来3w条训练数据时没问题,增加到6w条就会这样,这是bug嘛? > > 这是不是内存爆掉了 hello,这个问题后来解决了吗

> Thanks, I'll try that Have you fixed this issue under deepspeed zero3 mode? Please share some experience if possible. Much appreciated!

> Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). > > To avoid...

> > > Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). > >...

> > My dataset has samples containing 1/2 images. When training under dsz2, it gets stcuk. Training machine: 32*A100 > > Can you use `py-spy` to locate the issue? I...

> > > > Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). >...

To evaluate the effectiveness of this change, we conducted four comparative experiments: 1. **Synchronous training** using the original training engine's logits. 2. **Asynchronous training** using the original training engine's logits....

> The async LLM for the agentic RL also needs to support vLLM logits. Updated the corresponding training code. Local debugging shows no errors currently. Please review when you have...