ldh127 comments

Results 8 comments of


ldh127

[REQUEST] i want to know how to merge deepspeed multi gpu optim file into one pytorch optim.pt file ?

yes，but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can  process merge deepspeed multi gpu optim file into...

[REQUEST] i want to know how to merge deepspeed multi gpu optim file into one pytorch optim.pt file ?

> Hi @ldh127 - can you please be more specific, share more about what you are trying to do and what errors you are hitting? yes ,i use transformers trainer...

[REQUEST] i want to know how to merge deepspeed multi gpu optim file into one pytorch optim.pt file ?

> > yes，but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu...

[REQUEST] i want to know how to merge deepspeed multi gpu optim file into one pytorch optim.pt file ?

> > yes，but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu...

[REQUEST] i want to know how to merge deepspeed multi gpu optim file into one pytorch optim.pt file ?

> > yes，but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu...

[BUG] Fail to Resume From Checkpoint with Different GPU Number(Huggingface Trainer + Deepspeed)

> @tjruwase Hi😳sorry to bother again. I tried ZeRO-2 and got a ZeRO-2 checkpoint. But it seems the Accelerate+Deepspeed checkpoint structure is a bit different from [Universal Checkpoint examples](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing) in...

全参数微调Qwen1.5-MoE-A2.7似乎卡住了，显卡使用率100%，但是没有在训练

> > > > 我也遇到了这个问题，lora、zero2，跑了两次都是卡住，GPU利用率99%。然后AutoConfig加了output_router_logits=True就可以了，不知道是不是这个原因造成的 > > > > > > > > > 请问能再说的详细一点吗，是模型的config的output_router_logits=True就行了，还是有别的改变？ > > > > > > 是的，模型config的output_router_logits=True就行了，其他的没变 > > 为啥我这边试了还是不行，能贴下配置参数之类的么？这个我加了也不行，有解决这个问题吗

Save model checkpoint error when multi-gpu training still happens on 4.36.1

> in trainer.py line 2555 > > ```python > elif self.is_local_process_zero(): > # Clean up the remaining staging checkpoint folders on other nodes > if staging_output_dir != output_dir and os.path.exists(staging_output_dir):...