Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Hello, I used the 176B checkpoint of bloom-176B(https://huggingface.co/bigscience/bloom), but had problem in resolving the layer files. Should I download different type of checkpoint to use in this repo, or what...
Does BigScience also provide the original BLOOM checkpoints (without conversion to Huggingface 🤗). I am working on finetuning BLOOM (6.3B,2.5B,1.3B) and I need those checkpoint files. [issues/315](https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/315) In [https://github.com/bigscience-workshop/bigscience/tree/master/train/tr1-13B-base](url) ,I...
I am trying to get multi-node inference working with 4 nodes, each with 4xRTX8000 GPUs (48GB per GPU). `deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom`. The script finishes loading all the checkpoints...
## Update The issue turned out to be DeepSpeed usages of `pretrain_gpt_single_node.sh`. I will make a pull request soon. ## Original Report Please let me know what details I shall...
Following the training script here as a template: https://github.com/bigscience-workshop/bigscience/blob/master/train/tr1-13B-base/tr1-13B-round1.slurm I've trained some models using 2-way tensor parallelism and 4-way pipeline parallelism, which produces a number of checkpoints in directories like...
This PR adapts evaluation to work with Prefix LMs, such as used for T0 finetuning experiments. Using the normal eval harness I get the following results: Using `CHECKPOINT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr11f-6B3-ml/checkpoints/main/global_step163750` (CKPT prior...
I have to calculate the ETA for finishing training often enough that I think it should be a feature. How about we log the ETA along `elapsed time per iteration`?...
I have a PR open on Microsoft's DeepSpeed page that parallelizes the task of writing per-layer checkpoint files across data parallel instances: https://github.com/microsoft/DeepSpeed/pull/1419 On my system, I found that this...