Shane A

Results 8 issues of Shane A

The problem is that on LUMI, FSDP doesn't overlap computation and communication like it should. Evidence comes from this profiler trace: ![Image](https://github.com/allenai/LLM/assets/920638/9d5d4437-adc7-485d-97c3-4cf71643808f) It may be noteworthy that the NCCL GPU...

Updating the Llama config to use Llama block and RoPE lower precision, to match the behavior of bf16-autocast Llama more closely.

This is the first of a few PRs with refactored versions of changes I have been using locally for checkpoint management. This PR focusses on my changes to the storage...

This PR adds a script for comparing model state of 2 different OLMo models/checkpoints. This helped reveal that one of the LUMI containers was not loading/saving checkpoints correctly.

Issue: If we make a backward incompatible change or a regression, we don't have a mechanism to catch it. Also, if we start running jobs on a new platform we...

Since LUMI updated its hardware, I've updated the LUMI scripts and added a demo script. Note: py-spy doesn't work in the new container because it is no longer being maintained...

This PR adds the intermediate checkpoints and the run config for the OLMoE model. Code for OLMoE is not in main yet and will be a separate PR (probably by...