mamba icon indicating copy to clipboard operation
mamba copied to clipboard

Question about activation function choice:

Open ElliottDyson opened this issue 1 year ago • 4 comments

Just a point of curiosity really. I noticed in your code that either ReLU or Swish is used. I understand the choice of including ReLU as it is commonly accepted as being very efficient. However the usage of Swish I am a little more perplexed by when SwishGLU or Mish have previously shown better performance on transformer architectures?

I'd be very curious to see how this model performs if trained with the activation function set as Mish given it seems to be the best performing activation function for transformer encoder feed forward networks (as far as I'm aware).

ElliottDyson avatar Feb 02 '24 22:02 ElliottDyson

Good idea! We haven't experimented with it. To be honest I think these differences tend to get washed out with scale, but maybe not.

albertfgu avatar Feb 02 '24 23:02 albertfgu

Good idea! We haven't experimented with it. To be honest I think these differences tend to get washed out with scale, but maybe not.

Paper in point. Switching activation function back to ReLU used as a method to speed up llama. They got worse results on benchmarks, but they finetuned model on 5B tokens, when Llama was trained on trillions, so 🤷

Helps that most of activation functions are "Look like ReLU, but differentiable and slightly different around zero where most values are".

SwishGLU

It also requires parameters so it's not surprised that it will be better.

Though it actually interesting if these activation functions are good places to inject some sort of adapters.

Maykeye avatar Feb 03 '24 18:02 Maykeye

Good idea! We haven't experimented with it. To be honest I think these differences tend to get washed out with scale, but maybe not.

Paper in point. Switching activation function back to ReLU was one of the method to speed up llama. They got worse results on benchmarks, but they finetuned model on 5B tokens, when Llama was trained on trillions, so 🤷

Helps that most of activation functions are "Look like ReLU, but differentiable and slightly different around zero where most values are".

SwishGLU

It also requires parameters so it's not surprised that it will be better.

Though it actually interesting if these activation functions are good places to inject some sort of adapters.

What is the meaning behind them being potentially good places to inject adapters?

For the comparisons I have seen, they seem to aim to choose the dimensionality so that comparisons are better made for a given number of parameters so as to not give an unfair advantage to the GLU variants.

Also, I communicated with someone that's currently doing their research on activation functions and they were able to provide me one that was working significantly better than Mish, well, with a transformer network on the tiny stories dataset anyways, not that the specific dataset should matter much. But the equation is: tanh(X1) * (X2)^2 in case you plan on testing different activation functions on the network.

ElliottDyson avatar Feb 03 '24 18:02 ElliottDyson

What is the meaning behind them being good places to inject adapters?

Long story: arxiv:1902.00751. Short story: if LoRA replaces XW with XW + XAB and sees itself more as replacement for matrix, adapters replace any block f(X) with f(X) + g(XA)B (so they apply non-linearity after down proj, unlike lora, but in both cases original model is frozen and you train only small number of new parameters).

So I was wondering if its' possible to easily replace SiLU with something like SiLU(x) + trainable(SwiGLU(x)) and see how it perform after the training. Well, apparently the code really wants activation function to be silu in many places. And if fast path and special causal_conv_1d can be disabled, SiLU also seems to be baked into cuda kernel selective scan

Maykeye avatar Feb 03 '24 19:02 Maykeye