sile
sile
@suri-kunal hi, How did you fix the bug ? Can you tell me ?
@silverriver 大佬,你这个下载图片的脚本啥时候可以搞好哈
I came across the same error too (RuntimeError: still have inflight params) . When I use deepspeed to train RL on v100 * 8, this bug still exists. I also...
> Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you @HeyangQin This is a full record:https://github.com/microsoft/DeepSpeed/issues/4175. ~~I used HeyangQin/fix_issue_3156 to fix...
> > Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you > > @HeyangQin This is a full record:#4175. ~I used...
@MAJIN123 您好,我在用v100跑第三步的时候也遇到了oom的情况,请问您最后是怎么解决的哈,我这边也是把能调的都调到最小了
@Sleepychord 您好,可能我没有表达清楚,我的意思是torch_image.to(self.dtype).to(self.device),self.dtype的类型此时是什么,是float16,还是float32,你能方便打印看一下嘛?
@Sleepychord 您好,我在基于visualglm-6b训练reward model时遇到了一个错误,框架是用的是deepspeed_chat,具体报错如下: Beginning of Epoch 1/1, Total Micro Batches 30502 Traceback (most recent call last): File "/xxxxx/deepspeed_chat/training/step2_reward_model_finetuning/main.py", line 472, in main() File "/xxxxx/deepspeed_chat/training/step2_reward_model_finetuning/main.py", line 393, in main outputs =...