KyrieMing

Results 18 comments of KyrieMing

@dashstander Hello, which version of deepspeed does this PR requests? I am using deepspeed 0.8.2 as gpt-neox2.0 suggests, run deepspeed_to_deepspeed.sh have following erros: ``` deepspeed 0.8.2 fatal: not a git...

@dashstander I am using this Deepspeed [PR](https://github.com/EleutherAI/DeeperSpeed/pull/47), then I run deepspeed_to_deepspeed.sh, the error comes: ``` Convert DeepSpeed Checkpoint to DeepSpeed Checkpoint args = Namespace(config='/mnt/resources/checkpoints-neox2.0-6b-4096-256GPUs/global_step20000/configs/1-3B.yml', input_folder='/mnt/resources/checkpoints-neox2.0-6b-4096-256GPUs', output_folder='/mnt/resources/checkpoints-neox2.0-6b-4096-256GPUs-6dp', target_dp=48, target_pp=1, target_tp=1) Converting...

@dashstander @Quentin-Anthony @StellaAthena I compare the code between Neox1.0 and Neox2.0 and Find that when setting pipe-parallel-size=1: + In Neox1.0, is_pipe_parallel will be set to True, then the model is...

> Hey @sxthunder , thanks so much for looking at this! You do need to use Eleuther's fork of DeepSpeed and in particular [this branch that I referenced in the...

> > I manually transformer naive torch checkpoints to pipeline checkpoints, ds_to_ds run successfully but the loss still wrong. I assume this ds_to_ds.sh only works for GPT-Neox1.0 ckpt or PipelineModule...

Some Additional information: GPT-Neox 1.0 support elastic finetune: I pretraine a model in 64 GPU, it can be finetuned on any number of gpu(less than 64), which only works fine...