Pete Walsh comments

Results 311 comments of


                                            Pete Walsh

Stuff from PaLM that we don't have yet

Here's my attempt to list them, let me know if I'm missing anything: - [x] SwiGLU Activation instead of ReLU / GeLU. #29 This is defined as `Swish(xW) * xV`...

Stuff from PaLM that we don't have yet

Scaling logits got overlooked for bigger things, but we should at least have the option implemented. I'll take care of it.

Stuff from PaLM that we don't have yet

Scaling logits implemented in #239

Add triton implementation of layer norm

Unfortunately this did not work well on LUMI, even when running the LN in full precision. The purple run here is using `AMDLayerNorm` and the beige is using this triton...

Add triton implementation of layer norm

I'm marking this blocked until we figure out what's wrong with triton on LUMI.

Request for hardware requirements and training cost, etc

Hey @purefire, it's hard to give specific answers to some of these questions since training code is flexible enough to work on a wide variety of hardware configurations, yet the...

Request for hardware requirements and training cost, etc

@austinmw I'm not 100% certain on this, but I think you could probably train the 7B on 8 A100 GPUs with the following flags / config options for memory savings:...

Request for hardware requirements and training cost, etc

Looks like we're averaging about 80tps per GPU right now.

Request for hardware requirements and training cost, etc

@austinmw about ~50 days.

GQA into Mitchich65

@dirkgr I did add a config option to toggle the fused loss function in [54919e0](https://github.com/allenai/OLMo/pull/443/commits/54919e020e78d70b682b9295236b784eef2d0ed5). Defaults to false to be safe so will never run on LUMI.