Shane A issues

Results 8 issues of


                                            Shane A

FSDP Overlap Investigation

The problem is that on LUMI, FSDP doesn't overlap computation and communication like it should. Evidence comes from this profiler trace: ![Image](https://github.com/allenai/LLM/assets/920638/9d5d4437-adc7-485d-97c3-4cf71643808f) It may be noteworthy that the NCCL GPU...

Update Llama config to use Llama block and RoPE lower precision

Updating the Llama config to use Llama block and RoPE lower precision, to match the behavior of bf16-autocast Llama more closely.

[Storage Cleaner] Unsharding improvements

This is the first of a few PRs with refactored versions of changes I have been using locally for checkpoint management. This PR focusses on my changes to the storage...

Add OLMo 1.7-7b README + Config

Add script for comparing model state

This PR adds a script for comparing model state of 2 different OLMo models/checkpoints. This helped reveal that one of the LUMI containers was not loading/saving checkpoints correctly.

Add regression tests for training

Issue: If we make a backward incompatible change or a regression, we don't have a mechanism to catch it. Also, if we start running jobs on a new platform we...

Update LUMI scripts

Since LUMI updated its hardware, I've updated the LUMI scripts and added a demo script. Note: py-spy doesn't work in the new container because it is no longer being maintained...

Add OLMoE checkpoints and run config

This PR adds the intermediate checkpoints and the run config for the OLMoE model. Code for OLMoE is not in main yet and will be a separate PR (probably by...