sile comments

Results 8 comments of


                                            sile

[BUG] ZeRO3 - Getting assert len(self.ckpt_list) > 0 while running validation code during fine tuning

@suri-kunal hi, How did you fix the bug ? Can you tell me ?

url链接失效了，有下载好的图片吗？

@silverriver 大佬，你这个下载图片的脚本啥时候可以搞好哈

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

I came across the same error too (RuntimeError: still have inflight params) . When I use deepspeed to train RL on v100 * 8, this bug still exists. I also...

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

> Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you @HeyangQin This is a full record:https://github.com/microsoft/DeepSpeed/issues/4175. ~~I used HeyangQin/fix_issue_3156 to fix...

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

> > Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you > > @HeyangQin This is a full record:#4175. ~I used...

step 3 : OOM

@MAJIN123 您好，我在用v100跑第三步的时候也遇到了oom的情况，请问您最后是怎么解决的哈，我这边也是把能调的都调到最小了

modeling_chatglm.py里self.dtype具体是指？

@Sleepychord 您好，可能我没有表达清楚，我的意思是torch_image.to(self.dtype).to(self.device)，self.dtype的类型此时是什么，是float16，还是float32，你能方便打印看一下嘛？

modeling_chatglm.py里self.dtype具体是指？

@Sleepychord 您好，我在基于visualglm-6b训练reward model时遇到了一个错误，框架是用的是deepspeed_chat，具体报错如下： Beginning of Epoch 1/1, Total Micro Batches 30502 Traceback (most recent call last): File "/xxxxx/deepspeed_chat/training/step2_reward_model_finetuning/main.py", line 472, in main() File "/xxxxx/deepspeed_chat/training/step2_reward_model_finetuning/main.py", line 393, in main outputs =...