Pete Walsh

Results 311 comments of Pete Walsh

Here's my attempt to list them, let me know if I'm missing anything: - [x] SwiGLU Activation instead of ReLU / GeLU. #29 This is defined as `Swish(xW) * xV`...

Scaling logits got overlooked for bigger things, but we should at least have the option implemented. I'll take care of it.

Scaling logits implemented in #239

Unfortunately this did not work well on LUMI, even when running the LN in full precision. The purple run here is using `AMDLayerNorm` and the beige is using this triton...

I'm marking this blocked until we figure out what's wrong with triton on LUMI.

Hey @purefire, it's hard to give specific answers to some of these questions since training code is flexible enough to work on a wide variety of hardware configurations, yet the...

@austinmw I'm not 100% certain on this, but I think you could probably train the 7B on 8 A100 GPUs with the following flags / config options for memory savings:...

Looks like we're averaging about 80tps per GPU right now.

@dirkgr I did add a config option to toggle the fused loss function in [54919e0](https://github.com/allenai/OLMo/pull/443/commits/54919e020e78d70b682b9295236b784eef2d0ed5). Defaults to false to be safe so will never run on LUMI.