rational_activations icon indicating copy to clipboard operation
rational_activations copied to clipboard

Comparison to Mish activation

Open kayuksel opened this issue 4 years ago • 7 comments

Mish is the current most popular activation function. Thus, it would be good if you can also compare with it.

kayuksel avatar Feb 22 '21 21:02 kayuksel

Yes, we want to compare to it too, I'll upload updated graphs soon.

k4ntz avatar Mar 12 '21 09:03 k4ntz

@k4ntz It performed better than Mish in my case (also RL-like)

kayuksel avatar Apr 18 '21 17:04 kayuksel

Any tips on how to active? I used kaiming_uniform_ for Linear layers.

kayuksel avatar Apr 18 '21 18:04 kayuksel

Hi @kayuksel, thanks for these info ! Could you share some graph or link to result (even draft) showing that. We are also working for comparison against GeLU in transformers. For the initialisation, you can use xavier as in our imagenet classification task (https://github.com/ml-research/rational_sl/blob/main/imagenet/train_imagenet.py) and cifar (https://github.com/ml-research/rational_sl/blob/main/cifar/train.py). If I remember correctly, it empirically works better than kaiming (for big nets). Please share any other result you might have !

k4ntz avatar Apr 19 '21 06:04 k4ntz

@k4ntz I can provide the learning curves using both but I am unable to provide a draft publication yet as it is patent-pending. Thanks for suggesting the initialization function, I will then also try with that.

kayuksel avatar Apr 21 '21 17:04 kayuksel

@k4ntz FYI, my case is an adversarial setting so it is important that it re-adapt itself continuously. Thus, having also adaptable activation functions (besides the network itself) may be good in that sense.

Also, I prefer the network to overfit so the generalization is not a major concern (in case using an adaptable activation function may increase the chance of over-fitting for the model for certain tasks)

kayuksel avatar Apr 21 '21 17:04 kayuksel

Yes, using such activation functions provides the network with more modelling capacities, the transformation of the manifold through the layers might be more accurate if needed. I don't think Rational AFs overfit per se, but if they are used in network cherry pick to well perform at a task, this additional modelling power provides overfitting. We are also working on pruning rational nets. Whenever results are available, they will be provided. :)

k4ntz avatar Apr 21 '21 19:04 k4ntz