gpt-neox
gpt-neox copied to clipboard
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
**Describe the bug** When starting with `small.yml`, then changing ZeRO to 2 and `cpu_offload` to `true`, I get the following error: ``` RuntimeError: expected input to be on cuda ```...
**Describe the bug** Running the model gives the following warning: `[2021-11-20 20:08:18,491] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. ` We should update the way that our code...
The requirements files specified in ./requirements have historically been strict as to prevent CI Docker images changing without our prior knowledge. However, this places a burden on users who would...
**Describe the bug** preprocess_data script expects to have "text" column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have...
Getting this error on import of deepspeed. I am currently using torch 1.8.0 and installed the requirements.txt as directed. I am also not able to install the apex link provided.
**Describe the bug** It appears that imbalances in the distillation weights has a significant impact on performance. When I set them all equal to 1, it runs twice as fast...
**Is your feature request related to a problem? Please describe.** I´m frustrated because I can´t use my Geforce MX 250 to train a 13B GPT-NeoX. **Describe the solution you'd like**...
@preethamgali wrote a model distilling framework [here](https://github.com/EleutherAI/distilling) which we should aim to integrate into GPT-NeoX
**Describe the bug** Loss for RPE position embedding not going down [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04...