Sean Naren

Results 11 issues of Sean Naren

We should validate the speed difference between Megatron and this repo with the measured results. I think we can get away with reporting values from the profiler branch.

Due to https://github.com/facebookresearch/xformers/issues/286 we cannot currently fuse the bias/gelu/activation into a single kernel using triton. This means we're just use a standard [MLP](https://github.com/SeanNaren/min-LLM/blob/main/model.py#L113-L118) and are probably taking a perf hit....

enhancement

On 8 A100 with [this](https://github.com/SeanNaren/min-LLM/blob/main/train.py#L174-L190) deepspeed config, below is the measured TFLOPs: ``` deepspeed --num_gpus 8 train.py --batch_size_per_gpu 36 ``` ``` Estimates: 129.32TFLOPs Avg Iteration Time: 8.01s ``` Within the...

We're currently relying on the minGPT/microGPT initialization, however this might need to be modified especially considering we're using ZeRO Stage 3. Some investigation will be required to understand what the...

This branch is my attempt to try to squeeze the largest size model I can with BlockSparse vs standard dot product Attention + FSDP with optimal training from scratch throughput....

I'd like to document my current thinking of how I'll get to a final set of pre-trained weights, for a large(ish) transformer model. The plan will probably need multiple edits,...

## 🐛 Bug When using the Translation Task, we need to ensure that we skip padding tokens within the loss calculation. Currently we do not replace the padding with -100,...

bug / fix
help wanted

# What does this PR do ? Currently style checking is done on jenkins. This is problematic for users who do not have access to jenkins and cannot see the...

# What does this PR do ? Adds Exponential Moving Average (EMA) support. We've seen promising results in reducing training times to reach convergence. Things to do: - [x] Capture...

enhancement

# What does this PR do ? - Sync AMI evaluation dataset annotations with https://github.com/BUTSpeechFIT/AMI-diarization-setup. This follows the standardization by pyannote - Add training dataset **Collection**: - ASR # Changelog...

enhancement