mamba
mamba copied to clipboard
Question about activation function choice:
Just a point of curiosity really. I noticed in your code that either ReLU or Swish is used. I understand the choice of including ReLU as it is commonly accepted as being very efficient. However the usage of Swish I am a little more perplexed by when SwishGLU or Mish have previously shown better performance on transformer architectures?
I'd be very curious to see how this model performs if trained with the activation function set as Mish given it seems to be the best performing activation function for transformer encoder feed forward networks (as far as I'm aware).
Good idea! We haven't experimented with it. To be honest I think these differences tend to get washed out with scale, but maybe not.
Good idea! We haven't experimented with it. To be honest I think these differences tend to get washed out with scale, but maybe not.
Paper in point. Switching activation function back to ReLU used as a method to speed up llama. They got worse results on benchmarks, but they finetuned model on 5B tokens, when Llama was trained on trillions, so 🤷
Helps that most of activation functions are "Look like ReLU, but differentiable and slightly different around zero where most values are".
SwishGLU
It also requires parameters so it's not surprised that it will be better.
Though it actually interesting if these activation functions are good places to inject some sort of adapters.
Good idea! We haven't experimented with it. To be honest I think these differences tend to get washed out with scale, but maybe not.
Paper in point. Switching activation function back to ReLU was one of the method to speed up llama. They got worse results on benchmarks, but they finetuned model on 5B tokens, when Llama was trained on trillions, so 🤷
Helps that most of activation functions are "Look like ReLU, but differentiable and slightly different around zero where most values are".
SwishGLU
It also requires parameters so it's not surprised that it will be better.
Though it actually interesting if these activation functions are good places to inject some sort of adapters.
What is the meaning behind them being potentially good places to inject adapters?
For the comparisons I have seen, they seem to aim to choose the dimensionality so that comparisons are better made for a given number of parameters so as to not give an unfair advantage to the GLU variants.
Also, I communicated with someone that's currently doing their research on activation functions and they were able to provide me one that was working significantly better than Mish, well, with a transformer network on the tiny stories dataset anyways, not that the specific dataset should matter much. But the equation is: tanh(X1) * (X2)^2 in case you plan on testing different activation functions on the network.
What is the meaning behind them being good places to inject adapters?
Long story: arxiv:1902.00751.
Short story: if LoRA replaces XW
with XW + XAB
and sees itself more as replacement for matrix, adapters replace any block f(X)
with f(X) + g(XA)B
(so they apply non-linearity after down proj, unlike lora, but in both cases original model is frozen and you train only small number of new parameters).
So I was wondering if its' possible to easily replace SiLU with something like SiLU(x) + trainable(SwiGLU(x))
and see how it perform after the training.
Well, apparently the code really wants activation function to be silu in many places. And if fast path and special causal_conv_1d can be disabled, SiLU also seems to be baked into cuda kernel selective scan