gpt-neox
gpt-neox copied to clipboard
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to...
Some of our code is fairly underdocumented to say the least. Where possible, it would be good to: - Add input / output typehints to all functions - Add docstrings...
**Describe the bug** When running `tools/preprocess_data.py` to tokenize my dataset, I was confused why the generated `.bin` and `.idx` files were empty. It turns out that `lm_dataformat`, the library which...
re-based version of https://github.com/EleutherAI/gpt-neox/pull/466 Tested only on 20B
**Is your feature request related to a problem? Please describe.** Would be good to remove the megatron tensor parallelism code from NeoX, and [OSLO](https://github.com/tunib-ai/oslo) currently has support for this, and...
This is not meant to be merged directly. I just wanted to give an example of changes that you need to make to run on AMD GPUs (tested with rocm-4.5.2)....
**Is your feature request related to a problem? Please describe.** Training very large networks takes a lot of time, requires a lot of resources that are unavailable to many small...
Thank you for open source such a great repo for the community! Your work is really helping our team with training large pretrained model :) In our experiment, we find...
…mizer/blob/master/torch_optimizer/shampoo.py