ldh127

Results 8 comments of ldh127

yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can  process merge deepspeed multi gpu optim file into...

> Hi @ldh127 - can you please be more specific, share more about what you are trying to do and what errors you are hitting? yes ,i use transformers trainer...

> > yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can  process merge deepspeed multi gpu...

> > yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can  process merge deepspeed multi gpu...

> > yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can  process merge deepspeed multi gpu...

> @tjruwase Hi😳sorry to bother again. I tried ZeRO-2 and got a ZeRO-2 checkpoint. But it seems the Accelerate+Deepspeed checkpoint structure is a bit different from [Universal Checkpoint examples](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing) in...

> > > > 我也遇到了这个问题,lora、zero2,跑了两次都是卡住,GPU利用率99%。然后AutoConfig加了output_router_logits=True就可以了,不知道是不是这个原因造成的 > > > > > > > > > 请问能再说的详细一点吗,是模型的config的output_router_logits=True就行了,还是有别的改变? > > > > > > 是的,模型config的output_router_logits=True就行了,其他的没变 > > 为啥我这边试了 还是不行,能贴下配置参数之类的么? 这个我加了也不行,有解决这个问题吗

> in trainer.py line 2555 > > ```python > elif self.is_local_process_zero(): > # Clean up the remaining staging checkpoint folders on other nodes > if staging_output_dir != output_dir and os.path.exists(staging_output_dir):...