Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Question about downloading checkpoints of 6.3B,2.5B,1.3B
Does BigScience also provide the original BLOOM checkpoints (without conversion to Huggingface 🤗). I am working on finetuning BLOOM (6.3B,2.5B,1.3B) and I need those checkpoint files. issues/315
In https://github.com/bigscience-workshop/bigscience/tree/master/train/tr1-13B-base ,I found some urls but they are all offline.
I created 4 repos at https://huggingface.co/bigscience/ and now we can clone those as the dirs data will be output into:
cd $six_ALL_CCFRSCRATCH/checkpoints/tr1-13B
git clone https://huggingface.co/bigscience/tr1-13B-checkpoints checkpoints
git clone https://huggingface.co/bigscience/tr1-13B-tensorboard tensorboard
git clone https://huggingface.co/bigscience/tr1-13B-codecarbon codecarbon
git clone https://huggingface.co/bigscience/tr1-13B-logs logs
You can convert the HF checkpoints back to Megatron-DeepSpeed. See this (a bit hacky) script: https://gist.github.com/malteos/c194368594e16439c101b7bf27195fd1
You can convert the HF checkpoints back to Megatron-DeepSpeed. See this (a bit hacky) script: https://gist.github.com/malteos/c194368594e16439c101b7bf27195fd1
@malteos Thank you for your answer! However , in your code ,I need to specify a DeepSpeed checkpoint.
"checkpoint_dir",
type=str,
help="Path to the DeepSpeed checkpoint directory",
But I do not have a DeepSpeed checkpoint for (6.3B,2.5B,1.3B) to compare.
The script updates the weights of Deepspeed checkpoint directly on the disk with the weights from a HF checkpoint. So you just need to save an untrained DS checkpoint and update it afterwards. You can use the existing slurm scripts for that and only set the number of train steps to 1.