Sean Naren issues

Results 11 issues of


Sean Naren

Add performance section to README

We should validate the speed difference between Megatron and this repo with the measured results. I think we can get away with reporting values from the profiler branch.

Fuse MLP in attention mechanism

Due to https://github.com/facebookresearch/xformers/issues/286 we cannot currently fuse the bias/gelu/activation into a single kernel using triton. This means we're just use a standard [MLP](https://github.com/SeanNaren/min-LLM/blob/main/model.py#L113-L118) and are probably taking a perf hit....

enhancement

Improve DeepSpeed Stage 3 Throughput

On 8 A100 with [this](https://github.com/SeanNaren/min-LLM/blob/main/train.py#L174-L190) deepspeed config, below is the measured TFLOPs: ``` deepspeed --num_gpus 8 train.py --batch_size_per_gpu 36 ``` ``` Estimates: 129.32TFLOPs Avg Iteration Time: 8.01s ``` Within the...

Fix model initialisation

We're currently relying on the minGPT/microGPT initialization, however this might need to be modified especially considering we're using ZeRO Stage 3. Some investigation will be required to understand what the...

Using FSDP

This branch is my attempt to try to squeeze the largest size model I can with BlockSparse vs standard dot product Attention + FSDP with optimal training from scratch throughput....

High Level Plan for the Journey!

I'd like to document my current thinking of how I'll get to a final set of pre-trained weights, for a large(ish) transformer model. The plan will probably need multiple edits,...

Ignore padding tokens when using Translation Task + `padding='max_length'`

## 🐛 Bug When using the Translation Task, we need to ensure that we skip padding tokens within the loss calculation. Currently we do not replace the padding with -100,...

bug / fix

help wanted

Sean Naren